Data Eng Weekly #327
This week's issue covers quite the range of topics—Netflix's change data capture architecture, optimizing cloud costs at Segment, the Apache Arrow Flight protocol, Kubernetes operators/controllers, and Python concurrency. Also, a look at two new projects—MLflow Model Registry and DuckDB, an embedded columnar database engine.
The Apache Arrow blog writes about the new Arrow Flight protocol for sending data fast and efficiently (by sending data in Arrow format). The post goes into the motivation of Flight, describes some of the basics of a Flight server, describes how Flight builds on gRPC, and more. While it's still fairly early in the development process, Flight could prove to be really important for improving the efficiency of large scale data processing.
A look at the `pg_prewarm` extension for prewarming the PostgreSQL cache, including how to enable it to run automatically on server startup.
Netflix writes about Delta, their system for shuffling data between systems using change data capture (CDC). They've built delta connectors for MySQL and Postgres that stream data to Apache Kafka. The post discusses their Kafka configurations and the stream processing framework (built on Apache Flink) that processes the CDC data and enriches it to build denormalized records.
The MLflow Model Registry is a new extension to the MLflow project that provides an API and Web UI for uploading and promoting machine learning models across environments. It has first-class notions of environments/lifecycle stages (e.g. to promote from staging to production), which makes it a good mach for CI/CD tooling.
To speed up your Python scripts, you can use multithreading or multiprocessing. This post provides shows how, if you write your code in a functional way, you can introduce parallelism with only a few changes. It demonstrates the ThreadPoolExecutor, ProcessPoolExecutor, and the tradeoffs between the two.
In Kubernetes, operators and controllers are pretty common for stateful systems or those otherwise dealing with data. Even if you're not building a Kubernetes controller yourself, this post that describes the differences between the two is a good introduction.
DuckDB is a new embedded, columnar database optimized for analytics workloads. This post shows how to use it via Python bindings, and it compares performance with SQLite on a few queries.
Curated by Datadog ( http://www.datadog.com )
Kafka on Kubernetes: Just Because You Can, Doesn't Mean You Should! (New York) - Tuesday, October 22
Free Apache Kafka Workshop (Boston) - Tuesday, October 22
10th Data Engineering Meetup (Belo Horizonte) - Wednesday, October 23
Kuberoo (London) - Thursday, October 24
Trustly Duchess Meetup: Introduction to Apache Kafka and Reactive Java (Stockholm) - Wednesday, October 23
Design Principles for an Event-Driven Architecture/Streaming with KSQL (Las Rozas de Madrid) - Thursday, October 24
Extending Spark for Qbeast's SQL DataSource (Barcelona) - Thursday, October 24
Data Engineering with Delta Lake, Pulsar, and Spark-Tools (Paris) - Tuesday, October 22
Full Day Apache Cassandra & Kafka Workshop (Berlin) - Monday, October 21
FREE NOW Data Journey to Kafka (Hamburg) - Tuesday, October 22
Cassandra Meets Kafka at ApacheCon! (Berlin) - Wednesday, October 23
Apache Kylin Meetup @ OLX (Berlin) - Thursday, October 24
Rg-Dev #32 (Rzeszow) - Thursday, October 24
Sydney Data Engineering Meetup (Surry Hills) - Thursday, October 24
Fintech Production with Kafka Streams (Melbourne) - Thursday, October 24
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.