Data Eng Weekly #333

Feb 03, 2020

Hey all! it's been a while—hopefully everyone had a nice new year. I took the last ~6 weeks off, so there's quite a bit to catch up on. This week's issue has sixteen of the best articles from that time covering topics like Apache Kafka producers, distributed storage engines for Prometheus, Presto+Pinto, and Jepsen analysis of etcd. Also, Yelp writes about their Kafka architecture, and Teads writes about optimizing Spark applications using User Defined Aggregate Functions. Lots to read up on, whether you're looking for some tips to apply to your own system, new tools to try out, or learning more about how systems work under the hood.

A look back at the themes that have emerged from the last decade in technology, including a number of distributed systems items like the return of SQL and streaming. The post also predicts big areas for the next few years, like future of PaaS and Kubernetes (and also areas with no connection to distributed systems like retail, journalism, and social media).

https://medium.com/@copyconstruct/a-decade-in-review-in-tech-1cde76c9b43c

A look at the important configurations to improve throughput of your Kafka Producers, as well as the key metrics to monitor on the broker related to efficient producing.

https://www.jesseyates.com/2020/01/01/high-performance-kafka-producers.html

This post covers several solutions for running large scale prometheus deployments. These include Thanos, Cortex, M3DB, and VictoriaMetrics. There are a mix of architectures—some push and some pull as well as various trade-offs for things like cold storage backends.

https://monitoring2.substack.com/p/big-prometheus

This post describes using SQLite to replace a networked database, which is quite an interesting idea. By sharing SQLite as a datastore across containers, they're able to improve latency by over 20x for a system that's processing 500k messages per second. Several details to dig into here, and some commentary on the larger impacts of a change like this (which pushes complexity out of a data plane and into the control plane).

https://medium.com/@rbranson/sharing-sqlite-databases-across-containers-is-surprisingly-brilliant-bacb8d753054

A good look at how the Java Virtual Machine manages heap space, and how the garbage collector interacts with various memory regions in order to reclaim space. The post describes the various types of garbage collection strategy, the types of events that trigger a garbage collection, and more.

https://sematext.com/blog/java-garbage-collection/

A thorough introduction to connecting to databases from Java, covering JDBC, Hibernate, the Java Persistence API, lightweight libraries for Java SQL, and more. The post has lots of code examples, advice for when certain libraries are appropriate, and more.

https://www.marcobehler.com/guides/java-databases-jdbc-hibernate-spring-data

Presto's 2019 year in review—covering new syntax (e.g. adding comments and fetching just the first N rows), query optimizations (e.g. improvements to the cost based optimizer and lazy materialization), new connectors (elasticsearch, google sheets), and much more. The post also looks at what's next for Presto in 2020.

https://prestosql.io/blog/2020/01/01/2019-summary.html

Uber writes about how they've integrated Pinot, their real-time analytics system, with Presto for SQL queries. The article describes the architecture, and how they improved the connectors performance with predicate/limit/aggregate/more pushdown, and how it performs in practice.

https://eng.uber.com/engineering-sql-support-on-apache-pinot/

A look at instrumenting apps written in C, Java, and Golang for analysis using eBPF. The post is part of a larger series on eBPF, and there's an introduction to the main concepts at the top of the article.

https://sematext.com/blog/ebpf-userland-apps/

Quarkus, the java application framework, has a new extension to implement the outbox pattern (quite interesting to read about if you've missed earlier articles!) for change data capture. The Debezium blog has an introduction that walks through how to get started with Quarkus and the Outbox Quarkus Extension for generating events.

https://debezium.io/blog/2020/01/22/outbox-quarkus-extension/

Yelp writes about their Kafka infrastructure, which has several components. The post describes two of them in detail, the Stream Discovery and Allocation service (which enforces schemas and defines a stream as either fire and forget or acked) and "Monk Leaf" which is a service that runs locally and proxies to Kafka (implementing either the acked or fire and forget semantics). These components provide a platform that make it easy for developers to deploy applications and get data into the Kafka data pipeline.

https://engineeringblog.yelp.com/2020/01/streams-and-monk-how-yelp-approaches-kafka-in-2020.html

The third post in a series on Spark Job Optimization myths (the previous two looked at executor memory and number of executors), this post looks at why adding more memory to the Spark driver doesn't always improve performance. It has some tips that can improve the driver performance instead—like avoiding globals and avoiding expensive calls to functions like `collect()`. If you're looking at optimizing your Spark usage, this series is worth digging into.

https://www.davidmcginnis.net/post/spark-job-optimization-myth-3-i-need-more-driver-memory

A post from Teads describes both how to speed up a Spark application using a Spark User-Defined Aggregate Functions (UDAF) as well as how to optimize Spark applications in general. Their article walks through how they sped up one of their applications from 28 mins to 9 mins. Several of the optimizations are informed by the Spark execution DAG, several of which they analyze in the post.

https://medium.com/teads-engineering/apache-spark-udaf-could-be-an-option-c2bc25298276

An overview of the how the Apache Kafka idempotent producer works, including pointers to relevant pieces of the code that generate ids, complete batches, and more. The post details how the producer has been extended since the original implementation to support multiple in flight requests.

https://www.waitingforcode.com/apache-kafka/apache-kafka-idempotent-producer/read

Jepsen has a post on etcd, which is the key-value store used by Kubernetes and other distributed systems. The article describes several tests of correctness, which verified strict serializable operations and correct delivery of watches. They found some issues with locks, which the etcd team is addressing (more details in a companion blog post). As always, it's great to read about verification of distributed systems—the post brings together practice and theory in a way that builds extra context on both.

https://jepsen.io/analyses/etcd-3.4.3