Data Eng Weekly #327
This week's issue covers quite the range of topics—Netflix's change data capture architecture, optimizing cloud costs at Segment, the Apache Arrow Flight protocol, Kubernetes operators/controllers, and Python concurrency. Also, a look at two new projects—MLflow Model Registry and DuckDB, an embedded columnar database engine.
The Apache Arrow blog writes about the new Arrow Flight protocol for sending data fast and efficiently (by sending data in Arrow format). The post goes into the motivation of Flight, describes some of the basics of a Flight server, describes how Flight builds on gRPC, and more. While it's still fairly early in the development process, Flight could prove to be really important for improving the efficiency of large scale data processing.
http://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/
A look at the `pg_prewarm` extension for prewarming the PostgreSQL cache, including how to enable it to run automatically on server startup.
https://www.cybertec-postgresql.com/en/prewarming-postgresql-i-o-caches/
Netflix writes about Delta, their system for shuffling data between systems using change data capture (CDC). They've built delta connectors for MySQL and Postgres that stream data to Apache Kafka. The post discusses their Kafka configurations and the stream processing framework (built on Apache Flink) that processes the CDC data and enriches it to build denormalized records.
The MLflow Model Registry is a new extension to the MLflow project that provides an API and Web UI for uploading and promoting machine learning models across environments. It has first-class notions of environments/lifecycle stages (e.g. to promote from staging to production), which makes it a good mach for CI/CD tooling.
https://databricks.com/blog/2019/10/17/introducing-the-mlflow-model-registry.html
To speed up your Python scripts, you can use multithreading or multiprocessing. This post provides shows how, if you write your code in a functional way, you can introduce parallelism with only a few changes. It demonstrates the ThreadPoolExecutor, ProcessPoolExecutor, and the tradeoffs between the two.
http://pljung.de/posts/easy-concurrency-in-python/
In Kubernetes, operators and controllers are pretty common for stateful systems or those otherwise dealing with data. Even if you're not building a Kubernetes controller yourself, this post that describes the differences between the two is a good introduction.
https://octetz.com/posts/k8s-controllers-vs-operators
Segment describes several optimizations that they made to improve their infrastructure costs. The changes are across all parts of the stack—from data systems to the javascript file that they're serving to customers. On the data front—they describe changes they made to their deployments of Apache Kafka (switching to instances with local storage) and NSQ (moving from a colocated model to a centralized cluster). They also made changes to minimize cross-AZ transfer costs—alterations to Kafka clients and service discovery to keep traffic inside of a single zone.
https://segment.com/blog/the-10m-engineering-problem/
DuckDB is a new embedded, columnar database optimized for analytics workloads. This post shows how to use it via Python bindings, and it compares performance with SQLite on a few queries.
https://uwekorn.com/2019/10/19/taking-duckdb-for-a-spin.html
Events
Curated by Datadog ( http://www.datadog.com )
New York
Kafka on Kubernetes: Just Because You Can, Doesn't Mean You Should! (New York) - Tuesday, October 22
https://www.meetup.com/NYC-Open-Data/events/263390404/
Massachusetts
Free Apache Kafka Workshop (Boston) - Tuesday, October 22
https://www.meetup.com/aittg-boston/events/264304883/
BRAZIL
10th Data Engineering Meetup (Belo Horizonte) - Wednesday, October 23
https://www.meetup.com/engenharia-de-dados/events/265772897/
UNITED KINGDOM
Kuberoo (London) - Thursday, October 24
https://www.meetup.com/Kubernetes-London/events/265617529/
SWEDEN
Trustly Duchess Meetup: Introduction to Apache Kafka and Reactive Java (Stockholm) - Wednesday, October 23
https://www.meetup.com/Duchess-Sweden/events/265555150/
SPAIN
Design Principles for an Event-Driven Architecture/Streaming with KSQL (Las Rozas de Madrid) - Thursday, October 24
https://www.meetup.com/Madrid-Kafka/events/265321681/
Extending Spark for Qbeast's SQL DataSource (Barcelona) - Thursday, October 24
https://www.meetup.com/Spark-Barcelona/events/265706465/
FRANCE
Data Engineering with Delta Lake, Pulsar, and Spark-Tools (Paris) - Tuesday, October 22
https://www.meetup.com/Paris-Data-Engineers/events/264819837/
GERMANY
Full Day Apache Cassandra & Kafka Workshop (Berlin) - Monday, October 21
https://www.meetup.com/Distributed-Data-Berlin/events/264890586/
FREE NOW Data Journey to Kafka (Hamburg) - Tuesday, October 22
https://www.meetup.com/Hamburg-Kafka/events/265207803/
Cassandra Meets Kafka at ApacheCon! (Berlin) - Wednesday, October 23
https://www.meetup.com/Berlin-Cassandra-Users/events/265707785/
Apache Kylin Meetup @ OLX (Berlin) - Thursday, October 24
https://www.meetup.com/Apache-Kylin-Meetup-Berlin/events/264945114/
POLAND
Rg-Dev #32 (Rzeszow) - Thursday, October 24
https://www.meetup.com/rg-dev/events/262422311/
AUSTRALIA
Sydney Data Engineering Meetup (Surry Hills) - Thursday, October 24
https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/262769526/
Fintech Production with Kafka Streams (Melbourne) - Thursday, October 24
https://www.meetup.com/melbourne-distributed/events/265013568/
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.