In this week's issue, Robinhood and Zoomcar share their data infrastructure, and we learn about WePay's distributed write-ahead log (newly open sourced). There are also great articles on database tuning, the new garbage collectors in Java 11, testing distributed systems, and more.
Squarespace writes about how they drastically improved the performance of their MySQL deployment (p95 latency on response time went from 200ms to 50ms) backing their TLS infrastructure. The post talks about the architectural changes (making better use of hot read replicas, offloading unnecessary tasks) and tuning (connection pools, better indexes) that they made.
An article on the importance of testing and formal verification in distributed systems. The author also argues that functional programming and static typing can help narrow the amount of testing and make formal verification easier.
Cloudera shares some benchmarking of the G1GC, ZGC, and CMS Java 11 garbage collectors with HBase. They use the Yahoo Cloud Serving Benchmark to evaluate performance and improved settings for the HBase workload.
Zoomcar's data platform ingests data from a number of sources (mobility products as well as customer apps). They write about how the platform has evolved from analytics on a MySQL replica to a full-blown data platform with data in Kafka and S3. The post covers a lot of topics, such as how they ingest data from relational databases (plus schemas) and their clickstream.
WePay has open sourced Waltz, which is a distributed write-ahead log. They use Waltz to as the primary store for transactions, and they materialize views of the data to the database for each service. Waltz has a lot of features for serializability, which are described (along with the architecture) in this post. Waltz uses ZooKeeper for cluster management, and it has separate server and storage nodes.
Gojek shares some tips for configuring and tuning the Kafka Producer.
Robinhood writes about the infrastructure powering their data lake, which processes over 10TB/day and houses over 4PB of data. They ingest data from Kafka, storying it in S3 for batch processing with Apache Spark, AWS Athena/Presto, and Redshift. Workflows are coordinated with Apache Airflow, and they use Looker for BI.
This post provides an introduction to SQL ROLLUP, which provides a mechanism to compute aggregates at multiple levels of a grouping (when your GROUP BY has multiple columns). It also looks at the CUBE keyword, which provides a mechanism for computing even more levels of aggregates.
`fselect` is a handy CLI tool that presents a SQL-like query language for searching the file system (similar to *nix `find`). It also supports outputting results as JSON in addition to delimited text.
Curated by Datadog
Apache Kafka Data Durability (San Jose) - Thursday, September 19
Evolving Data Technologies: Survey of Data Technology Trends (Bellevue) - Wednesday, September 18
Real-Time Analytics with Apache Druid at Fullcontact (Denver) - Tuesday, September 17
Event-Driven Architecture with Kafka and Containers (Brookfield) - Wednesday, September 18
District of Columbia
Survey of Real-Time Data Platforms: Cassandra, Spark, Akka, Kafka, Etc. (Washington) - Thursday, September 19
Apache Kafka for the Enterprise: IBM Event Streams (Toronto) - Monday, September 16
Spark Meetup: Real-Time Edition (Dublin) - Thursday, September 19
Parquet Optimisations + Building Spark Data Pipelines (London) - Wednesday, September 18
Building Stream Processing Applications with Apache Kafka Using KSQL (Manchester) - Thursday, September 19
Helsinki Apache Kafka Meetup (Helsinki) - Monday, September 16
Kafka Streams and the Tide of Data (Barcelona) - Wednesday, September 18
FinistDevs: Apache Flink & WebAssembly (Le Relecq-Kerhuon) - Thursday, September 19
Building Stream Processing Applications with Apache Kafka Using KSQL (Dortmund) - Tuesday, September 17
On Track with Apache Kafka: Building a Streaming ETL Solution with Rail Data (Eschborn) - Wednesday, September 18
Orchestrate Kafka on Kubernetes + Kafka @DATEV (Nuremberg) - Wednesday, September 18
Building Stream Processing Applications with Apache Kafka Using KSQL (Rome) - Monday, September 16
Dissolving the Problem: Kafka Is More ACID Than Your Database (Gdansk) - Monday, September 16
Riding Endless Streams with Kafka (Sofia) - Thursday, September 19
Melbourne Data Engineering Meetup (Melbourne) - Thursday, September 19
Sydney Data Engineering Meetup (Surry Hills) - Thursday, September 19
NZ Data Engineering Meetup #1: Snowflake and Your Data Lake (Auckland) - Thursday, September 19
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.