Data Eng Weekly #327

Oct 21, 2019

This week's issue covers quite the range of topics—Netflix's change data capture architecture, optimizing cloud costs at Segment, the Apache Arrow Flight protocol, Kubernetes operators/controllers, and Python concurrency. Also, a look at two new projects—MLflow Model Registry and DuckDB, an embedded columnar database engine.

The Apache Arrow blog writes about the new Arrow Flight protocol for sending data fast and efficiently (by sending data in Arrow format). The post goes into the motivation of Flight, describes some of the basics of a Flight server, describes how Flight builds on gRPC, and more. While it's still fairly early in the development process, Flight could prove to be really important for improving the efficiency of large scale data processing.

http://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/

A look at the `pg_prewarm` extension for prewarming the PostgreSQL cache, including how to enable it to run automatically on server startup.

https://www.cybertec-postgresql.com/en/prewarming-postgresql-i-o-caches/

Netflix writes about Delta, their system for shuffling data between systems using change data capture (CDC). They've built delta connectors for MySQL and Postgres that stream data to Apache Kafka. The post discusses their Kafka configurations and the stream processing framework (built on Apache Flink) that processes the CDC data and enriches it to build denormalized records.

https://medium.com/netflix-techblog/delta-a-data-synchronization-and-enrichment-platform-e82c36a79aee

The MLflow Model Registry is a new extension to the MLflow project that provides an API and Web UI for uploading and promoting machine learning models across environments. It has first-class notions of environments/lifecycle stages (e.g. to promote from staging to production), which makes it a good mach for CI/CD tooling.

https://databricks.com/blog/2019/10/17/introducing-the-mlflow-model-registry.html

To speed up your Python scripts, you can use multithreading or multiprocessing. This post provides shows how, if you write your code in a functional way, you can introduce parallelism with only a few changes. It demonstrates the ThreadPoolExecutor, ProcessPoolExecutor, and the tradeoffs between the two.

http://pljung.de/posts/easy-concurrency-in-python/

In Kubernetes, operators and controllers are pretty common for stateful systems or those otherwise dealing with data. Even if you're not building a Kubernetes controller yourself, this post that describes the differences between the two is a good introduction.

https://octetz.com/posts/k8s-controllers-vs-operators

Segment describes several optimizations that they made to improve their infrastructure costs. The changes are across all parts of the stack—from data systems to the javascript file that they're serving to customers. On the data front—they describe changes they made to their deployments of Apache Kafka (switching to instances with local storage) and NSQ (moving from a colocated model to a centralized cluster). They also made changes to minimize cross-AZ transfer costs—alterations to Kafka clients and service discovery to keep traffic inside of a single zone.

https://segment.com/blog/the-10m-engineering-problem/

DuckDB is a new embedded, columnar database optimized for analytics workloads. This post shows how to use it via Python bindings, and it compares performance with SQLite on a few queries.

https://uwekorn.com/2019/10/19/taking-duckdb-for-a-spin.html

Events

Curated by Datadog ( http://www.datadog.com )

New York

Kafka on Kubernetes: Just Because You Can, Doesn't Mean You Should! (New York) - Tuesday, October 22

https://www.meetup.com/NYC-Open-Data/events/263390404/

Massachusetts

Free Apache Kafka Workshop (Boston) - Tuesday, October 22

https://www.meetup.com/aittg-boston/events/264304883/

BRAZIL

10th Data Engineering Meetup (Belo Horizonte) - Wednesday, October 23

https://www.meetup.com/engenharia-de-dados/events/265772897/

UNITED KINGDOM

Kuberoo (London) - Thursday, October 24

https://www.meetup.com/Kubernetes-London/events/265617529/

SWEDEN

Trustly Duchess Meetup: Introduction to Apache Kafka and Reactive Java (Stockholm) - Wednesday, October 23

https://www.meetup.com/Duchess-Sweden/events/265555150/

SPAIN

Design Principles for an Event-Driven Architecture/Streaming with KSQL (Las Rozas de Madrid) - Thursday, October 24

https://www.meetup.com/Madrid-Kafka/events/265321681/

Extending Spark for Qbeast's SQL DataSource (Barcelona) - Thursday, October 24

https://www.meetup.com/Spark-Barcelona/events/265706465/

FRANCE

Data Engineering with Delta Lake, Pulsar, and Spark-Tools (Paris) - Tuesday, October 22

https://www.meetup.com/Paris-Data-Engineers/events/264819837/

GERMANY

Full Day Apache Cassandra & Kafka Workshop (Berlin) - Monday, October 21

https://www.meetup.com/Distributed-Data-Berlin/events/264890586/

FREE NOW Data Journey to Kafka (Hamburg) - Tuesday, October 22

https://www.meetup.com/Hamburg-Kafka/events/265207803/

Cassandra Meets Kafka at ApacheCon! (Berlin) - Wednesday, October 23

https://www.meetup.com/Berlin-Cassandra-Users/events/265707785/

Apache Kylin Meetup @ OLX (Berlin) - Thursday, October 24

https://www.meetup.com/Apache-Kylin-Meetup-Berlin/events/264945114/

POLAND

Rg-Dev #32 (Rzeszow) - Thursday, October 24

https://www.meetup.com/rg-dev/events/262422311/

AUSTRALIA

Sydney Data Engineering Meetup (Surry Hills) - Thursday, October 24

https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/262769526/

Fintech Production with Kafka Streams (Melbourne) - Thursday, October 24

https://www.meetup.com/melbourne-distributed/events/265013568/

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Eng Weekly