Data Eng Weekly #324

In this week's issue, Robinhood and Zoomcar share their data infrastructure, and we learn about WePay's distributed write-ahead log (newly open sourced). There are also great articles on database tuning, the new garbage collectors in Java 11, testing distributed systems, and more.

Squarespace writes about how they drastically improved the performance of their MySQL deployment (p95 latency on response time went from 200ms to 50ms) backing their TLS infrastructure. The post talks about the architectural changes (making better use of hot read replicas, offloading unnecessary tasks) and tuning (connection pools, better indexes) that they made.

An article on the importance of testing and formal verification in distributed systems. The author also argues that functional programming and static typing can help narrow the amount of testing and make formal verification easier.

Cloudera shares some benchmarking of the G1GC, ZGC, and CMS Java 11 garbage collectors with HBase. They use the Yahoo Cloud Serving Benchmark to evaluate performance and improved settings for the HBase workload.

Zoomcar's data platform ingests data from a number of sources (mobility products as well as customer apps). They write about how the platform has evolved from analytics on a MySQL replica to a full-blown data platform with data in Kafka and S3. The post covers a lot of topics, such as how they ingest data from relational databases (plus schemas) and their clickstream.

WePay has open sourced Waltz, which is a distributed write-ahead log. They use Waltz to as the primary store for transactions, and they materialize views of the data to the database for each service. Waltz has a lot of features for serializability, which are described (along with the architecture) in this post. Waltz uses ZooKeeper for cluster management, and it has separate server and storage nodes.

Gojek shares some tips for configuring and tuning the Kafka Producer.

Robinhood writes about the infrastructure powering their data lake, which processes over 10TB/day and houses over 4PB of data. They ingest data from Kafka, storying it in S3 for batch processing with Apache Spark, AWS Athena/Presto, and Redshift. Workflows are coordinated with Apache Airflow, and they use Looker for BI.

This post provides an introduction to SQL ROLLUP, which provides a mechanism to compute aggregates at multiple levels of a grouping (when your GROUP BY has multiple columns). It also looks at the CUBE keyword, which provides a mechanism for computing even more levels of aggregates.

`fselect` is a handy CLI tool that presents a SQL-like query language for searching the file system (similar to *nix `find`). It also supports outputting results as JSON in addition to delimited text.


Curated by Datadog


Apache Kafka Data Durability (San Jose) - Thursday, September 19


Evolving Data Technologies: Survey of Data Technology Trends (Bellevue) - Wednesday, September 18


Real-Time Analytics with Apache Druid at Fullcontact (Denver) - Tuesday, September 17


Event-Driven Architecture with Kafka and Containers (Brookfield) - Wednesday, September 18

District of Columbia

Survey of Real-Time Data Platforms: Cassandra, Spark, Akka, Kafka, Etc. (Washington) - Thursday, September 19


Apache Kafka for the Enterprise: IBM Event Streams (Toronto) - Monday, September 16


Spark Meetup: Real-Time Edition (Dublin) - Thursday, September 19


Parquet Optimisations + Building Spark Data Pipelines (London) - Wednesday, September 18

Building Stream Processing Applications with Apache Kafka Using KSQL (Manchester) - Thursday, September 19


Helsinki Apache Kafka Meetup (Helsinki) - Monday, September 16


Kafka Streams and the Tide of Data (Barcelona) - Wednesday, September 18


FinistDevs: Apache Flink & WebAssembly (Le Relecq-Kerhuon) - Thursday, September 19


Building Stream Processing Applications with Apache Kafka Using KSQL (Dortmund) - Tuesday, September 17

On Track with Apache Kafka: Building a Streaming ETL Solution with Rail Data (Eschborn) - Wednesday, September 18

Orchestrate Kafka on Kubernetes + Kafka @DATEV (Nuremberg) - Wednesday, September 18


Building Stream Processing Applications with Apache Kafka Using KSQL (Rome) - Monday, September 16


Dissolving the Problem: Kafka Is More ACID Than Your Database (Gdansk) - Monday, September 16


Riding Endless Streams with Kafka (Sofia) - Thursday, September 19


Melbourne Data Engineering Meetup (Melbourne) - Thursday, September 19

Sydney Data Engineering Meetup (Surry Hills) - Thursday, September 19


NZ Data Engineering Meetup #1: Snowflake and Your Data Lake (Auckland) - Thursday, September 19

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.