Data Eng Weekly #327

This week's issue covers quite the range of topics—Netflix's change data capture architecture, optimizing cloud costs at Segment, the Apache Arrow Flight protocol, Kubernetes operators/controllers, and Python concurrency. Also, a look at two new projects—MLflow Model Registry and DuckDB, an embedded columnar database engine.

The Apache Arrow blog writes about the new Arrow Flight protocol for sending data fast and efficiently (by sending data in Arrow format). The post goes into the motivation of Flight, describes some of the basics of a Flight server, describes how Flight builds on gRPC, and more. While it's still fairly early in the development process, Flight could prove to be really important for improving the efficiency of large scale data processing.

A look at the `pg_prewarm` extension for prewarming the PostgreSQL cache, including how to enable it to run automatically on server startup.

Netflix writes about Delta, their system for shuffling data between systems using change data capture (CDC). They've built delta connectors for MySQL and Postgres that stream data to Apache Kafka. The post discusses their Kafka configurations and the stream processing framework (built on Apache Flink) that processes the CDC data and enriches it to build denormalized records.

The MLflow Model Registry is a new extension to the MLflow project that provides an API and Web UI for uploading and promoting machine learning models across environments. It has first-class notions of environments/lifecycle stages (e.g. to promote from staging to production), which makes it a good mach for CI/CD tooling.

To speed up your Python scripts, you can use multithreading or multiprocessing. This post provides shows how, if you write your code in a functional way, you can introduce parallelism with only a few changes. It demonstrates the ThreadPoolExecutor, ProcessPoolExecutor, and the tradeoffs between the two.

In Kubernetes, operators and controllers are pretty common for stateful systems or those otherwise dealing with data. Even if you're not building a Kubernetes controller yourself, this post that describes the differences between the two is a good introduction.

Segment describes several optimizations that they made to improve their infrastructure costs. The changes are across all parts of the stack—from data systems to the javascript file that they're serving to customers. On the data front—they describe changes they made to their deployments of Apache Kafka (switching to instances with local storage) and NSQ (moving from a colocated model to a centralized cluster). They also made changes to minimize cross-AZ transfer costs—alterations to Kafka clients and service discovery to keep traffic inside of a single zone.

DuckDB is a new embedded, columnar database optimized for analytics workloads. This post shows how to use it via Python bindings, and it compares performance with SQLite on a few queries.


Curated by Datadog ( )

New York

Kafka on Kubernetes: Just Because You Can, Doesn't Mean You Should! (New York) - Tuesday, October 22


Free Apache Kafka Workshop (Boston) - Tuesday, October 22


10th Data Engineering Meetup (Belo Horizonte) - Wednesday, October 23


Kuberoo (London) - Thursday, October 24


Trustly Duchess Meetup: Introduction to Apache Kafka and Reactive Java (Stockholm) - Wednesday, October 23


Design Principles for an Event-Driven Architecture/Streaming with KSQL (Las Rozas de Madrid) - Thursday, October 24

Extending Spark for Qbeast's SQL DataSource (Barcelona) - Thursday, October 24


Data Engineering with Delta Lake, Pulsar, and Spark-Tools (Paris) - Tuesday, October 22


Full Day Apache Cassandra & Kafka Workshop (Berlin) - Monday, October 21

FREE NOW Data Journey to Kafka (Hamburg) - Tuesday, October 22

Cassandra Meets Kafka at ApacheCon! (Berlin) - Wednesday, October 23

Apache Kylin Meetup @ OLX (Berlin) - Thursday, October 24


Rg-Dev #32 (Rzeszow) - Thursday, October 24


Sydney Data Engineering Meetup (Surry Hills) - Thursday, October 24

Fintech Production with Kafka Streams (Melbourne) - Thursday, October 24

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.