Data Eng Weekly #318

21 July 2019

Several great posts this week, including some advice on operating distributed systems and a look a deduplicating data at scale at MixPanel. There are also several new tools in this week's issue—Ballista is a query engine written in Rust, LinkedIn has open sourced Brooklin, and OctoSQL is a new tool for querying across data sources.


Presto writes about implementing dynamic filtering in which they evaluate predicates when parsing files. By reading much less data, they see significant speed ups for joins that require shuffling data over the network.

WePay has a second post about Change Data Capture (CDC) for Apache Cassandra, in which they cover the architecture of the CDC agent, loading data from Kafka to BigQuery, and the transformations they do within BigQuery (they use views which are periodically materialized). Pretty interesting details about how they bootstrap the agent with a full table scan and the details of the commit log/queue processors. There's also some good advice on their philosophy for shipping a solution quickly by keeping scope small.

An engineer on the payment systems team at Uber shares several best practices for working with distributed systems. The post covers a lot of ground, including monitoring, alerting, outages, postmortems, failover drills, and much more. The post full of advice, such as valuing investment in keeping a system reliable as it scales up.

The BRIN, or Block Range INdex, is a new (introduced in version 9.5) PostgreSQL index type. BRIN stores summary information about blocks of pages, which can be used to filter out pages at query time. It takes up less space than a B-Tree index, and has some other interesting characteristics. The Percona blog goes through this and other details

Ballista is a POC-stage data querying engine written in Rust with Kubernetes as the distributed orchestration layer. Applications can be written in SQL or using a dataframe-like API. Ballista is written by the same author as DataFusion, the Rust implementation of Apache Arrow.

LinkedIn has open sourced Brooklin, their tool for streaming data between systems. Brooklin is a multi-tenant, dedicated service with dynamic provisioning/management (via REST endpoints). At LinkedIn, it's used to stream data between streaming systems (e.g. Kafka and Azure EventHubs as well as Kafka to Kafka), change data capture, and more. It currently supports MySQL, Cosmos DB, and Azure SQL as data sources as well as Kinesis, Cosmos DB, and Couchbase as destinations. The intro post has several more details.

Pinterest writes about their deployment of Presto. It's a rather large deployment—they have 14k CPU cores with 100 TBs of RAM across their fleet, and they service over 400,000 queries per month. They've built a ton of interesting analytics and monitoring about their clusters, such as measuring cluster stability by analyzing the number of started but not completed queries. They also talk about challenges—like deeply nested data structures, straggling workers, and more.

When they don't receive a 200 OK response, MixPanel's client applications retry the submission of analytics data, which can result in duplicates. At their scale, maintaining a global index of all event ids is expensive—so MixPanel has come up with a neat solution where they partition data and use in-memory indexes per partition to dedup.

OctoSQL is a tool for querying multiple data sources—whether remote DBs or local files. It seems like it could be useful for joining a one-off dataset with some data in your main SQL data. OctoSQL is written in golang, and configuration of datasources is via a simple yaml configuration format.


Curated by Datadog ( )


All about Streaming: Monitor Kafka Like a Pro + Apache Pulsar (San Francisco) - Wednesday, July 24

Bay Area K8s Meetup (San Jose) - Thursday, July 25


Kafka Streams and GraphQL (Columbus) - Monday, July 22


Streaming with Kafka + Introduction to KSQL (Atlanta) - Thursday, July 25

North Carolina

Everything You Wanted to Know about Apache Kafka but Were Too Afraid to Ask! (Durham) - Monday, July 22

Learn about Apache Kafka (Charlotte) - Tuesday, July 23


Kafka Top Ten Configurations (Columbia) - Thursday, July 25


Challenges That Everyone Struggles with While Productionizing Apache Spark (Toronto) - Wednesday, July 24


Apache Beam Meetup 3: Beam Portability + Visual Pipeline Development + ML w/ Beam (Stockholm) - Thursday, July 25


Streaming Your Events with Kafka Streams & KSQL (Lisboa) - Wednesday, July 24


Data Fundamentals for Growing Companies: Culture & Architecture (Tel Aviv-Yafo) - Monday, July 22


Melbourne Data Engineering Meetup (Docklands) - Wednesday, July 24

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.