Data Eng Weekly #324

In this week's issue, Robinhood and Zoomcar share their data infrastructure, and we learn about WePay's distributed write-ahead log (newly open sourced). There are also great articles on database tuning, the new garbage collectors in Java 11, testing distributed systems, and more.

Squarespace writes about how they drastically improved the performance of their MySQL deployment (p95 latency on response time went from 200ms to 50ms) backing their TLS infrastructure. The post talks about the architectural changes (making better use of hot read replicas, offloading unnecessary tasks) and tuning (connection pools, better indexes) that they made.

An article on the importance of testing and formal verification in distributed systems. The author also argues that functional programming and static typing can help narrow the amount of testing and make formal verification easier.

Cloudera shares some benchmarking of the G1GC, ZGC, and CMS Java 11 garbage collectors with HBase. They use the Yahoo Cloud Serving Benchmark to evaluate performance and improved settings for the HBase workload.

Zoomcar's data platform ingests data from a number of sources (mobility products as well as customer apps). They write about how the platform has evolved from analytics on a MySQL replica to a full-blown data platform with data in Kafka and S3. The post covers a lot of topics, such as how they ingest data from relational databases (plus schemas) and their clickstream.

WePay has open sourced Waltz, which is a distributed write-ahead log. They use Waltz to as the primary store for transactions, and they materialize views of the data to the database for each service. Waltz has a lot of features for serializability, which are described (along with the architecture) in this post. Waltz uses ZooKeeper for cluster management, and it has separate server and storage nodes.

Gojek shares some tips for configuring and tuning the Kafka Producer.

Robinhood writes about the infrastructure powering their data lake, which processes over 10TB/day and houses over 4PB of data. They ingest data from Kafka, storying it in S3 for batch processing with Apache Spark, AWS Athena/Presto, and Redshift. Workflows are coordinated with Apache Airflow, and they use Looker for BI.

This post provides an introduction to SQL ROLLUP, which provides a mechanism to compute aggregates at multiple levels of a grouping (when your GROUP BY has multiple columns). It also looks at the CUBE keyword, which provides a mechanism for computing even more levels of aggregates.

`fselect` is a handy CLI tool that presents a SQL-like query language for searching the file system (similar to *nix `find`). It also supports outputting results as JSON in addition to delimited text.


Curated by Datadog


Apache Kafka Data Durability (San Jose) - Thursday, September 19


Evolving Data Technologies: Survey of Data Technology Trends (Bellevue) - Wednesday, September 18


Real-Time Analytics with Apache Druid at Fullcontact (Denver) - Tuesday, September 17


Event-Driven Architecture with Kafka and Containers (Brookfield) - Wednesday, September 18

District of Columbia

Survey of Real-Time Data Platforms: Cassandra, Spark, Akka, Kafka, Etc. (Washington) - Thursday, September 19


Apache Kafka for the Enterprise: IBM Event Streams (Toronto) - Monday, September 16


Spark Meetup: Real-Time Edition (Dublin) - Thursday, September 19


Parquet Optimisations + Building Spark Data Pipelines (London) - Wednesday, September 18

Building Stream Processing Applications with Apache Kafka Using KSQL (Manchester) - Thursday, September 19


Helsinki Apache Kafka Meetup (Helsinki) - Monday, September 16


Kafka Streams and the Tide of Data (Barcelona) - Wednesday, September 18


FinistDevs: Apache Flink & WebAssembly (Le Relecq-Kerhuon) - Thursday, September 19


Building Stream Processing Applications with Apache Kafka Using KSQL (Dortmund) - Tuesday, September 17

On Track with Apache Kafka: Building a Streaming ETL Solution with Rail Data (Eschborn) - Wednesday, September 18

Orchestrate Kafka on Kubernetes + Kafka @DATEV (Nuremberg) - Wednesday, September 18


Building Stream Processing Applications with Apache Kafka Using KSQL (Rome) - Monday, September 16


Dissolving the Problem: Kafka Is More ACID Than Your Database (Gdansk) - Monday, September 16


Riding Endless Streams with Kafka (Sofia) - Thursday, September 19


Melbourne Data Engineering Meetup (Melbourne) - Thursday, September 19

Sydney Data Engineering Meetup (Surry Hills) - Thursday, September 19


NZ Data Engineering Meetup #1: Snowflake and Your Data Lake (Auckland) - Thursday, September 19

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Eng Weekly #323

As long as you’re interested in microservices, asynchronous programming models, event sourcing, stream processing, or statically typed languages, then there should be something in this week’s issue for you. Lots of coverage from some big thinking (about ORMs and if they’re approaching the problem incorrectly) to practical advice about how to get the most of out of a review for analytics code.

This article is quite the introduction to working with Kafka Streams from Clojure, covering the main Kafka Streams API as well as the Willa library for writing idiomatic Clojure. The post also describes transducers in Clojure (a mechanism for building generic data transformations) and has a couple of useful examples.

A good list of best practices in a microservices architecture. If you’re feeling pain with your deployments or architecture, then this post has lots of ideas for how to improve (like making sure you can upgrade a database schema without updating multiple services). The author notes that a half-baked microservices architecture can lead to failure scenarios that are more likely than they would be in a monolith.

When asking folks about ORMs, you're bound to get strong opinions. This piece describes why the main design goals of an ORM (i.e. modeling your data in your language/framework) can end up generating bad SQL queries. The post argues for another approach—starting with SQL and using that to generate your ORM.

The morning paper writes about SLOG, a new multi-region database system. Different from similar systems (Google Spanner and Calvin), SLOG introduces the idea of a "home" region. It has two modes of operation—either synchronously replicating within a single home region or a HA mode that replicates across regions. By taking advantage of the locality within a region, the system can improve throughput and latency.

A collection of good code review tips applied to analytics code (i.e. mostly SQL or Python for data munging). For example, there's a section on consistently naming your data models and fields as well as one on DRYing up a DAG or a CTE.

Futures are used quite often in distributed systems on the JVM (and also in node.js and other systems). This post provides a good basic introduction using an example of building a parallel web crawler.

A good list of pitfalls and anti-patterns both for those getting started with and looking at expanding usage of Apache Cassandra. For example—it's important to know your query pattern before you model your data, and Cassandra shouldn't be used as a queue. The post covers seven potential mistakes in some detail and mentions a few others to keep an eye out for, too.

Jepsen has published a new analysis of YugaByte DB. In this post, they test the upcoming support for serializable transactions, and they find a few problems (including with DEFAULT values and anti-dependency cycles). As always, the Jepsen post has a good overview of YugaByte, the consistency model, the test design, and more.

This article describes Derivative Event Sourcing. For legacy or other applications that you can't change, you can derive and publish events to an event stream for consumption by downstream applications. Change Data Capture via a database is a common mechanism for implementing this, but you could also derive using application or other logs.

Since a lot of data infrastructure code tends to be written in Python (and also because I just found this post interesting!), I'm sharing Dropbox's post on how they rolled out type checking for their Python codebase. The post motivates why you might want static typing and it dives deep into the performance improvements the team made to scale to 5 million lines of code.


Curated by Datadog ( )


Free Apache Spark One-Day Hands-On Workshop (Santa Clara) - Sunday, September 15 


Cleveland Big Data Mega Meetup (Cleveland) - Monday, September 9

South Carolina

Beyond Stateless --> Stateful K8s with Do's and Don'ts (Greenville) - Thursday, September 12


Vancouver Spark Meetup @ Galvanize (Vancouver) - Thursday, September 12


Data Meetup 2019.2 (Sao Carlos) - Wednesday, September 11


Building a Streaming ETL Solution with Rail Data (Leeds) - Wednesday, September 11

Data Platform User Group: Cosmos DB, Spark (Leeds) - Thursday, September 12


Safe Event Processing + Kafka at Norsk Tipping (Oslo) - Tuesday, September 10


GOTO Night with Erik Dornenburg & Kresten Thorup (Hamburg) - Monday, September 9


Data Natives Vienna v 7.0 (Vienna) - Thursday, September 12


Journey of Two Streaming Frameworks: Spark Streaming and Kafka Streams (Tel Aviv-Yafo) - Sunday, September 15

Women in Big Data Answer Any Question about Their Jobs (Tel Aviv-Yafo) - Sunday, September 15 

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Eng Weekly #322

Lots of content to read through this week—everything from tutorials (introduction to Elasticsearch) to debugging stories (from the folks at Gitlab) to performance improvements (a nice speedup for Presto and deployment architecture improvements for Flink) to some talks from Leslie Lamport on Paxos and TLA+. Also a couple of interesting new tools (PartiQL and dqlite) to check out!


The Alibaba blog has a look at how Flink Improvement Proposal 6 updates the deployment architecture to support standalone, YARN, and Kubernetes deployments. For YARN, the new architecture will support spinning up workers ahead of time to improve latency of jobs.

PartiQL is a new query language and runtime for relational and nested data (such as that stored in a Parquet file). The query language is SQL-compliant—it's in use at Amazon to query data across several data stores (like Amazon S3, Amazon Redshift Spectrum, and more). There's an open source project on Github, and the Couchbase project is looking to adopt it.

The Presto blog has an article about a new optimization for queries over nested/array data that require unnesting. By using a dictionary encoding, they can avoid materializing the nested data. This improves performance by as much as 9x with even less cpu usage.

Leslie Lamport, who wont he Turing Award for his work in distributed systems, has shared two lectures on distributed computing. He covers Paxos and TLA+ in the two videos, and he also shares slides/write ups and follow up exercises if you want to dive even deeper.

Amazon CTO Werner Vogels writes about how they build applications at Amazon—and no surprise distributed systems play a big role. He writes about their migration from a monolith to microservices, purpose-built databases, their operational model (focusing on serverless), and security.

dqlite, which extends sqlite to run in a distributed, fault tolerant setup, just hit version 1.0. It uses Raft for data replication and its own wire protocol.

Who doesn't love a complex distributed systems debug story with a happy ending? Gitlab writes about how they identified and resolved an issue with their SSHD fleet. They also share six lessons that they learned in the process.

The morning paper covers an article that looks at the tradeoffs between various OLAP database services in AWS. The authors benchmark using TPC-H, and based on these tests they suggest storing data in S3 using a columnar format. This provides good data portability—enabling Athena and Redshift Specturm, two tools that can offer quite a cost savings for sporadic queries.

It's always interesting when a small change has a big impact. In this case, adding three lines of code (switching from a streaming inserts to low-frequency loads from Google Cloud Storage) resulted in a 95% drop in cost for loading data into BigQuery.

This article is a great introduction to Elasticsearch. It has both breath (covering some operational aspects, how to query data in elastic, and running Kibana) and depth (covering the details of how to efficiently load CSV data into elastic using the bulk insert API and refreshing indices afterwards). For those wanting to learn elastic—this is a great place to get started. And if you've used elastic before, there's a decent chance you'll learn something new.


Curated by Datadog ( )


Data Engineering Meetup (San Diego) - Thursday, September 5


Kafka as a Platform: The Ecosystem from the Ground Up (Greenwood Village) - Tuesday, September 3

District of Columbia

DC Data Engineering 1: Fast! Big! Distributed! - Tuesday September 3

Stream Processing with the Spring Framework, Like You've Never Seen It Before (Washington) - Thursday, September 5 


Making Apache Spark Better with Delta Lake (Toronto) - Thursday, September 5


Apache Kafka and KSQL in Action! Let’s Build a Streaming Data Pipeline! (Dublin) - Thursday, September 5


Self-Service Data Platforms with Spark, Kafka, and Avro (Barcelona) - Thursday, September 5


Apache Spark Meetup @ AWS (Courbevoie) - Thursday, September 5


Apache Kafka: Reaching the Castle (Herzliya) - Tuesday, September 3


"Kafka Day" with Neha Narkhede, Kafka Co-creator et al. (Bengaluru) - Sunday, September 8


A Small Introduction to Kafka and Using It as a Data Platform in Production (Melbourne) - Wednesday, September 4

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Eng Weekly #321

25 August 2019

If you're reading this, then the transition to the new email service is a success! You should expect the same great content with a slightly new look.

As for this week's issue, there's coverage of some tools (DBT, Debezium for MySQL), distributed systems architecture (the Databricks Delta Lake transaction log, Timescale's distributed time series DB, and an overview of consistency and isolation levels), and posts on RocksDB and Twitter's new open source telemetry agent. There should be something good for everyone!


This tutorial describes how to enable the MySQL binary log for streaming change data capture (i.e. producing a record for each insert, update, delete into a MySQL table) to Apache Kafka using Debezium.

Klarna writes about Diftong, their tool for validating changes to workflows by comparing data sets produced before and after the changes. Diftong is a general purpose tool that works with any two data sets sharing a schema by applying some general purpose techniques—deduplicating data and calculating row-/column-level statistics. There's a full paper on the tool and how it's used at Klarna if you want to learn even more than this post.

The Delta Lake framework maintains a transaction log alongside a data set to provide atomicity. The transaction log is stored as JSON with each file representing a commit. This post dives into the details of this implementation, including optimizations using checkpoints, optimistic concurrency control, and handling of conflicts.

Timescale writes about their distributed time series database built on PostgreSQL, which is under development and in private beta. The post describes how they use "chunking" rather than "sharding" to distribute data across nodes in the cluster, presents the high level architecture (access and data nodes), and demonstrates how the system handles inserts and queries.

The Dremio blog covers a relatively new feature of Apache Arrow, the Flight data transfer protocol. Flight is built on gRPC and aims to saturate networks while also having low CPU overhead by using the Arrow in-memory data representation (i.e. no deserialization or serialization).

Rezolus is a new open source telemetry agent from Twitter. It's written in Rust, and it implements sophisticated data collection and sampling in order to detect short (e.g. <10 seconds) anomalous events.

Rockset writes about how they improved performance of bulk loading data into RocksDB. They parallelize writes, optimize compactions, and more. Overall, they get a 20x speedup over the original approach.

The Telegraph Engineering blog writes about dbt, the Data Building Tool, for building data transformations. It describes the major functionality of dbt, like its UI for viewing data sources and models, its framework for writing templated queries, and its functionality for building data check tests (e.g. guaranteeing unique values or that a column is never null in a data set).

This post provides an overview of both isolation levels and consistency levels, and it describes why in many case you need guarantees for both. In many cases, we use terms to that describe both an isolation and consistency level, so it's all a bit complicated. But definitely worth understanding these terms if you're working with data systems that are throwing these terms around!


Curated by Datadog ( )


Apache Heron Hands-On (Sunnyvale) - Monday, August 26

Apache Druid and YuniKorn: Universal Resource Scheduler for Both K8s and Yarn (San Francisco) - Wednesday, August 28


Kafka Streams on Kubernetes with RocksDB & Ktables Plus Avro! (Scottsdale) - Tuesday, August 27


Kicking Your Database to the Curb (Reston) - Tuesday, August 27


Apache Spark on Kubernetes (Frankfurt) - Tuesday, August 27


Apache Kafka Meetup at Swiss Re (Zurich) - Monday, August 26


Helsinki Apache Kafka Meetup (Helsinki) - Tuesday, August 27


Data Engineering Melbourne Meetup (Melbourne) - Thursday, August 29

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Eng Weekly #318

21 July 2019

Several great posts this week, including some advice on operating distributed systems and a look a deduplicating data at scale at MixPanel. There are also several new tools in this week's issue—Ballista is a query engine written in Rust, LinkedIn has open sourced Brooklin, and OctoSQL is a new tool for querying across data sources.


Presto writes about implementing dynamic filtering in which they evaluate predicates when parsing files. By reading much less data, they see significant speed ups for joins that require shuffling data over the network.

WePay has a second post about Change Data Capture (CDC) for Apache Cassandra, in which they cover the architecture of the CDC agent, loading data from Kafka to BigQuery, and the transformations they do within BigQuery (they use views which are periodically materialized). Pretty interesting details about how they bootstrap the agent with a full table scan and the details of the commit log/queue processors. There's also some good advice on their philosophy for shipping a solution quickly by keeping scope small.

An engineer on the payment systems team at Uber shares several best practices for working with distributed systems. The post covers a lot of ground, including monitoring, alerting, outages, postmortems, failover drills, and much more. The post full of advice, such as valuing investment in keeping a system reliable as it scales up.

The BRIN, or Block Range INdex, is a new (introduced in version 9.5) PostgreSQL index type. BRIN stores summary information about blocks of pages, which can be used to filter out pages at query time. It takes up less space than a B-Tree index, and has some other interesting characteristics. The Percona blog goes through this and other details

Ballista is a POC-stage data querying engine written in Rust with Kubernetes as the distributed orchestration layer. Applications can be written in SQL or using a dataframe-like API. Ballista is written by the same author as DataFusion, the Rust implementation of Apache Arrow.

LinkedIn has open sourced Brooklin, their tool for streaming data between systems. Brooklin is a multi-tenant, dedicated service with dynamic provisioning/management (via REST endpoints). At LinkedIn, it's used to stream data between streaming systems (e.g. Kafka and Azure EventHubs as well as Kafka to Kafka), change data capture, and more. It currently supports MySQL, Cosmos DB, and Azure SQL as data sources as well as Kinesis, Cosmos DB, and Couchbase as destinations. The intro post has several more details.

Pinterest writes about their deployment of Presto. It's a rather large deployment—they have 14k CPU cores with 100 TBs of RAM across their fleet, and they service over 400,000 queries per month. They've built a ton of interesting analytics and monitoring about their clusters, such as measuring cluster stability by analyzing the number of started but not completed queries. They also talk about challenges—like deeply nested data structures, straggling workers, and more.

When they don't receive a 200 OK response, MixPanel's client applications retry the submission of analytics data, which can result in duplicates. At their scale, maintaining a global index of all event ids is expensive—so MixPanel has come up with a neat solution where they partition data and use in-memory indexes per partition to dedup.

OctoSQL is a tool for querying multiple data sources—whether remote DBs or local files. It seems like it could be useful for joining a one-off dataset with some data in your main SQL data. OctoSQL is written in golang, and configuration of datasources is via a simple yaml configuration format.


Curated by Datadog ( )


All about Streaming: Monitor Kafka Like a Pro + Apache Pulsar (San Francisco) - Wednesday, July 24

Bay Area K8s Meetup (San Jose) - Thursday, July 25


Kafka Streams and GraphQL (Columbus) - Monday, July 22


Streaming with Kafka + Introduction to KSQL (Atlanta) - Thursday, July 25

North Carolina

Everything You Wanted to Know about Apache Kafka but Were Too Afraid to Ask! (Durham) - Monday, July 22

Learn about Apache Kafka (Charlotte) - Tuesday, July 23


Kafka Top Ten Configurations (Columbia) - Thursday, July 25


Challenges That Everyone Struggles with While Productionizing Apache Spark (Toronto) - Wednesday, July 24


Apache Beam Meetup 3: Beam Portability + Visual Pipeline Development + ML w/ Beam (Stockholm) - Thursday, July 25


Streaming Your Events with Kafka Streams & KSQL (Lisboa) - Wednesday, July 24


Data Fundamentals for Growing Companies: Culture & Architecture (Tel Aviv-Yafo) - Monday, July 22


Melbourne Data Engineering Meetup (Docklands) - Wednesday, July 24

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Loading more posts…