Data Eng Weekly #328

Oct 28, 2019

Short and sweet issue this week, with several new open source tools—Beekeeper for cleaning up unused data, the Mantis project for real-time operations, and pg_flame's flame graphs for analyzing postgres queries—as well as implementation articles covering Apache Airflow, Rust for Kafka, and using bloom filters to optimize GDPR data deletion.

Beekeeper is a new open-source tool from Expedia for cleaning up unreferenced data in Apache Hive tables. It listens for metastore changes (like changes to table or partition locations) and periodically deletes abandoned data in S3. The github repo has instructions for trying it out using docker.

https://medium.com/expedia-group-tech/introducing-beekeeper-35ab8b770f23

Netflix has open sourced Mantis, its operations-focused event analysis system. Mantis aims to provide cost-effective, event-based analysis of a live system. It does so by allowing applications to publish lots of events but only incur costs of those events if there's a subscriber of the event stream. At Netflix, they use Mantis for things like monitoring video streaming health, contextual alerting, measuring Cassandra health, and alerting on log events. The Mantis github page has lots more details about the system, including an overview of its architecture and details on how to try it out.

https://medium.com/netflix-techblog/open-sourcing-mantis-a-platform-for-building-cost-effective-realtime-operations-focused-5b8ff387813a

https://netflix.github.io/mantis/

A look at how the Devoted Health team uses Apache Airflow. They deploy on Kubernetes (using Terraform and Helm), which allows developers to get their own instance of the stack for development/testing. They've built a tool for standardized DAG development (using YAML definitions) and a dev tool for synchronizing code changes to Kubernetes. The post also describes how they write integration tests and validations, deploy, and monitor their deployment.

https://adam.boscarino.me/posts/airflow-at-devoted-health/

Another post on Airflow deployments, Lyft writes about how they've implemented fine-grained secure access to the Apache Airflow UI. While Airflow has built-in RBAC, they built a custom security manager that adds DAG-level access permissions (defined alongside the DAG). At Lyft, each team has its own RBAC role and can decide which teams have access to the DAGs they publish.

https://eng.lyft.com/securing-apache-airflow-ui-with-dag-level-access-a7bc649a2821

A post on the Confluent blog describes porting a non-trivial Kafka application from Clojure to Rust. The author describes the tradeoffs between the various Rust libraries for Kafka, how to extend a client to support Avro records/Schema Registry, and shares some benchmarks comparing Clojure and Rust performance/memory usage.

https://www.confluent.io/blog/getting-started-with-rust-and-kafka

The Adobe Experience team writes about how they use bloom filters to speed up deletions of user data, for requests like those in GDPR. By maintaining a bloom filter for each partition of a data set, they can prune large swaths of data that definitely don't contain records for a particular user when deleting.

https://medium.com/adobetech/search-optimization-for-large-data-sets-for-gdpr-7c2f52d4ea1f

`pg_flame` looks like a handy tool for visualizing the output of postgres' `EXPLAIN ANALYZE`. Mousing over an entry in the flame graph reveals more details about that step of the `EXPLAIN`.

https://github.com/mgartner/pg_flame