Data Eng Weekly #324
In this week's issue, Robinhood and Zoomcar share their data infrastructure, and we learn about WePay's distributed write-ahead log (newly open sourced). There are also great articles on database tuning, the new garbage collectors in Java 11, testing distributed systems, and more.
Squarespace writes about how they drastically improved the performance of their MySQL deployment (p95 latency on response time went from 200ms to 50ms) backing their TLS infrastructure. The post talks about the architectural changes (making better use of hot read replicas, offloading unnecessary tasks) and tuning (connection pools, better indexes) that they made.
An article on the importance of testing and formal verification in distributed systems. The author also argues that functional programming and static typing can help narrow the amount of testing and make formal verification easier.
https://blog.colinbreck.com/on-eliminating-error-in-distributed-software-systems/
Cloudera shares some benchmarking of the G1GC, ZGC, and CMS Java 11 garbage collectors with HBase. They use the Yahoo Cloud Serving Benchmark to evaluate performance and improved settings for the HBase workload.
https://blog.cloudera.com/cdh6-3-hbase-g1-gc-tuning-with-jdk11/
Zoomcar's data platform ingests data from a number of sources (mobility products as well as customer apps). They write about how the platform has evolved from analytics on a MySQL replica to a full-blown data platform with data in Kafka and S3. The post covers a lot of topics, such as how they ingest data from relational databases (plus schemas) and their clickstream.
WePay has open sourced Waltz, which is a distributed write-ahead log. They use Waltz to as the primary store for transactions, and they materialize views of the data to the database for each service. Waltz has a lot of features for serializability, which are described (along with the architecture) in this post. Waltz uses ZooKeeper for cluster management, and it has separate server and storage nodes.
https://wecode.wepay.com/posts/waltz-a-distributed-write-ahead-log
Gojek shares some tips for configuring and tuning the Kafka Producer.
https://blog.gojekengineering.com/how-to-unlock-the-full-potential-of-kafka-producers-e1a6877e2167
Robinhood writes about the infrastructure powering their data lake, which processes over 10TB/day and houses over 4PB of data. They ingest data from Kafka, storying it in S3 for batch processing with Apache Spark, AWS Athena/Presto, and Redshift. Workflows are coordinated with Apache Airflow, and they use Looker for BI.
https://robinhood.engineering/data-lake-at-robinhood-3e9cdf963368
This post provides an introduction to SQL ROLLUP, which provides a mechanism to compute aggregates at multiple levels of a grouping (when your GROUP BY has multiple columns). It also looks at the CUBE keyword, which provides a mechanism for computing even more levels of aggregates.
https://dev.to/griffinator76/rollup-like-a-boss-3dkl
`fselect` is a handy CLI tool that presents a SQL-like query language for searching the file system (similar to *nix `find`). It also supports outputting results as JSON in addition to delimited text.
https://cli.fan/posts/fselect/
Events
Curated by Datadog
California
Apache Kafka Data Durability (San Jose) - Thursday, September 19
https://www.meetup.com/BayLISA/events/264201177/
Washington
Evolving Data Technologies: Survey of Data Technology Trends (Bellevue) - Wednesday, September 18
https://www.meetup.com/Big-Data-Bellevue-BDB/events/262650432/
Colorado
Real-Time Analytics with Apache Druid at Fullcontact (Denver) - Tuesday, September 17
https://www.meetup.com/Denver-Apache-Druid-Meetup-by-Imply/events/264007236/
Wisconsin
Event-Driven Architecture with Kafka and Containers (Brookfield) - Wednesday, September 18
https://www.meetup.com/Techmke/events/264255810/
District of Columbia
Survey of Real-Time Data Platforms: Cassandra, Spark, Akka, Kafka, Etc. (Washington) - Thursday, September 19
https://www.meetup.com/Cassandra-DataStax-DC/events/264344711/
CANADA
Apache Kafka for the Enterprise: IBM Event Streams (Toronto) - Monday, September 16
https://www.meetup.com/IBM-Cloud-Toronto/events/264134916/
IRELAND
Spark Meetup: Real-Time Edition (Dublin) - Thursday, September 19
https://www.meetup.com/Dublin-Spark-Meetup/events/264285037/
UNITED KINGDOM
Parquet Optimisations + Building Spark Data Pipelines (London) - Wednesday, September 18
https://www.meetup.com/Spark-London/events/264184808/
Building Stream Processing Applications with Apache Kafka Using KSQL (Manchester) - Thursday, September 19
https://www.meetup.com/Manchester-Kafka/events/263968143/
FINLAND
Helsinki Apache Kafka Meetup (Helsinki) - Monday, September 16
https://www.meetup.com/Helsinki-Apache-Kafka-Meetup/events/263025896/
SPAIN
Kafka Streams and the Tide of Data (Barcelona) - Wednesday, September 18
https://www.meetup.com/Meetup-de-Big-Data-de-datahack-en-Barcelona/events/264490794/
FRANCE
FinistDevs: Apache Flink & WebAssembly (Le Relecq-Kerhuon) - Thursday, September 19
https://www.meetup.com/FinistDevs/events/264598116/
GERMANY
Building Stream Processing Applications with Apache Kafka Using KSQL (Dortmund) - Tuesday, September 17
https://www.meetup.com/Dortmund-Kafka/events/263555199/
On Track with Apache Kafka: Building a Streaming ETL Solution with Rail Data (Eschborn) - Wednesday, September 18
https://www.meetup.com/Frankfurt-Apache-Kafka-Meetup-by-Confluent/events/263895041/
Orchestrate Kafka on Kubernetes + Kafka @DATEV (Nuremberg) - Wednesday, September 18
https://www.meetup.com/Nurnberg-Kafka/events/264136545/
ITALY
Building Stream Processing Applications with Apache Kafka Using KSQL (Rome) - Monday, September 16
https://www.meetup.com/Roma-Kafka-meetup-group/events/263968301/
POLAND
Dissolving the Problem: Kafka Is More ACID Than Your Database (Gdansk) - Monday, September 16
https://www.meetup.com/Gdansk-Kafka/events/264429138/
BULGARIA
Riding Endless Streams with Kafka (Sofia) - Thursday, September 19
https://www.meetup.com/Leanplum-Tech-Talks-Sofia/events/264368211/
AUSTRALIA
Melbourne Data Engineering Meetup (Melbourne) - Thursday, September 19
https://www.meetup.com/Melbourne-Data-Engineering-Meetup/events/262799936/
Sydney Data Engineering Meetup (Surry Hills) - Thursday, September 19
https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/262330474/
NEW ZEALAND
NZ Data Engineering Meetup #1: Snowflake and Your Data Lake (Auckland) - Thursday, September 19
https://www.meetup.com/New-Zealand-Data-Engineering-Meetup/events/263637937/
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.