Data Eng Weekly #324

Sep 16, 2019

In this week's issue, Robinhood and Zoomcar share their data infrastructure, and we learn about WePay's distributed write-ahead log (newly open sourced). There are also great articles on database tuning, the new garbage collectors in Java 11, testing distributed systems, and more.

Squarespace writes about how they drastically improved the performance of their MySQL deployment (p95 latency on response time went from 200ms to 50ms) backing their TLS infrastructure. The post talks about the architectural changes (making better use of hot read replicas, offloading unnecessary tasks) and tuning (connection pools, better indexes) that they made.

https://engineering.squarespace.com/blog/2019/performance-tuning-postgres-within-our-tls-infrastructure

An article on the importance of testing and formal verification in distributed systems. The author also argues that functional programming and static typing can help narrow the amount of testing and make formal verification easier.

https://blog.colinbreck.com/on-eliminating-error-in-distributed-software-systems/

Cloudera shares some benchmarking of the G1GC, ZGC, and CMS Java 11 garbage collectors with HBase. They use the Yahoo Cloud Serving Benchmark to evaluate performance and improved settings for the HBase workload.

https://blog.cloudera.com/cdh6-3-hbase-g1-gc-tuning-with-jdk11/

Zoomcar's data platform ingests data from a number of sources (mobility products as well as customer apps). They write about how the platform has evolved from analytics on a MySQL replica to a full-blown data platform with data in Kafka and S3. The post covers a lot of topics, such as how they ingest data from relational databases (plus schemas) and their clickstream.

https://medium.com/@shanker.sneh/https-medium-com-shanker-sneh-data-platform-at-zoomcar-a-narrative-part-i-f2455e3e2ae5

WePay has open sourced Waltz, which is a distributed write-ahead log. They use Waltz to as the primary store for transactions, and they materialize views of the data to the database for each service. Waltz has a lot of features for serializability, which are described (along with the architecture) in this post. Waltz uses ZooKeeper for cluster management, and it has separate server and storage nodes.

https://wecode.wepay.com/posts/waltz-a-distributed-write-ahead-log

Gojek shares some tips for configuring and tuning the Kafka Producer.

https://blog.gojekengineering.com/how-to-unlock-the-full-potential-of-kafka-producers-e1a6877e2167

Robinhood writes about the infrastructure powering their data lake, which processes over 10TB/day and houses over 4PB of data. They ingest data from Kafka, storying it in S3 for batch processing with Apache Spark, AWS Athena/Presto, and Redshift. Workflows are coordinated with Apache Airflow, and they use Looker for BI.

https://robinhood.engineering/data-lake-at-robinhood-3e9cdf963368

This post provides an introduction to SQL ROLLUP, which provides a mechanism to compute aggregates at multiple levels of a grouping (when your GROUP BY has multiple columns). It also looks at the CUBE keyword, which provides a mechanism for computing even more levels of aggregates.

https://dev.to/griffinator76/rollup-like-a-boss-3dkl

`fselect` is a handy CLI tool that presents a SQL-like query language for searching the file system (similar to *nix `find`). It also supports outputting results as JSON in addition to delimited text.

https://cli.fan/posts/fselect/

Events

Curated by Datadog

California

Apache Kafka Data Durability (San Jose) - Thursday, September 19

https://www.meetup.com/BayLISA/events/264201177/

Washington

Evolving Data Technologies: Survey of Data Technology Trends (Bellevue) - Wednesday, September 18

https://www.meetup.com/Big-Data-Bellevue-BDB/events/262650432/

Colorado

Real-Time Analytics with Apache Druid at Fullcontact (Denver) - Tuesday, September 17

https://www.meetup.com/Denver-Apache-Druid-Meetup-by-Imply/events/264007236/

Wisconsin

Event-Driven Architecture with Kafka and Containers (Brookfield) - Wednesday, September 18

https://www.meetup.com/Techmke/events/264255810/

District of Columbia

Survey of Real-Time Data Platforms: Cassandra, Spark, Akka, Kafka, Etc. (Washington) - Thursday, September 19

https://www.meetup.com/Cassandra-DataStax-DC/events/264344711/

CANADA

Apache Kafka for the Enterprise: IBM Event Streams (Toronto) - Monday, September 16

https://www.meetup.com/IBM-Cloud-Toronto/events/264134916/

IRELAND

Spark Meetup: Real-Time Edition (Dublin) - Thursday, September 19

https://www.meetup.com/Dublin-Spark-Meetup/events/264285037/

UNITED KINGDOM

Parquet Optimisations + Building Spark Data Pipelines (London) - Wednesday, September 18

https://www.meetup.com/Spark-London/events/264184808/

Building Stream Processing Applications with Apache Kafka Using KSQL (Manchester) - Thursday, September 19

https://www.meetup.com/Manchester-Kafka/events/263968143/

FINLAND

Helsinki Apache Kafka Meetup (Helsinki) - Monday, September 16

https://www.meetup.com/Helsinki-Apache-Kafka-Meetup/events/263025896/

SPAIN

Kafka Streams and the Tide of Data (Barcelona) - Wednesday, September 18

https://www.meetup.com/Meetup-de-Big-Data-de-datahack-en-Barcelona/events/264490794/

FRANCE

FinistDevs: Apache Flink & WebAssembly (Le Relecq-Kerhuon) - Thursday, September 19

https://www.meetup.com/FinistDevs/events/264598116/

GERMANY

Building Stream Processing Applications with Apache Kafka Using KSQL (Dortmund) - Tuesday, September 17

https://www.meetup.com/Dortmund-Kafka/events/263555199/

On Track with Apache Kafka: Building a Streaming ETL Solution with Rail Data (Eschborn) - Wednesday, September 18

https://www.meetup.com/Frankfurt-Apache-Kafka-Meetup-by-Confluent/events/263895041/

Orchestrate Kafka on Kubernetes + Kafka @DATEV (Nuremberg) - Wednesday, September 18

https://www.meetup.com/Nurnberg-Kafka/events/264136545/

ITALY

Building Stream Processing Applications with Apache Kafka Using KSQL (Rome) - Monday, September 16

https://www.meetup.com/Roma-Kafka-meetup-group/events/263968301/

POLAND

Dissolving the Problem: Kafka Is More ACID Than Your Database (Gdansk) - Monday, September 16

https://www.meetup.com/Gdansk-Kafka/events/264429138/

BULGARIA

Riding Endless Streams with Kafka (Sofia) - Thursday, September 19

https://www.meetup.com/Leanplum-Tech-Talks-Sofia/events/264368211/

AUSTRALIA

Melbourne Data Engineering Meetup (Melbourne) - Thursday, September 19

https://www.meetup.com/Melbourne-Data-Engineering-Meetup/events/262799936/

Sydney Data Engineering Meetup (Surry Hills) - Thursday, September 19

https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/262330474/

NEW ZEALAND

NZ Data Engineering Meetup #1: Snowflake and Your Data Lake (Auckland) - Thursday, September 19

https://www.meetup.com/New-Zealand-Data-Engineering-Meetup/events/263637937/

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Eng Weekly

Discussion about this post

Ready for more?