Data Eng Weekly #330

A lot of breadth in this week's coverage of data engineering topics—from Delta to event processing at Spotify to the design goals of TileDB to understanding delivery guarantees in Apache Kafka. Also, a few articles on distributed systems—Yelp's autoscaling service, techniques for building reliable systems, and Facebook’s global routing infrastructure.

A look at how to use Delta's version tracking to "time travel" (read the contents of a table as it was at some previous point in time) and to produce audit details to track changes to a table. The post embeds several IPython notebooks with sample Scala code.

Yelp writes about their newly open-sourced autoscaler for Kubernetes, Clusterman. Compared to the Kubernetes Autoscaler, there are some interesting features like support for spot instances and the ability to simulate production workloads.

With the complexity that comes with a distributed system, it can be overwhelming to start the process of identifying and classifying failure modes. This post provides a great high level blueprint for how to approach this problem—including bucketing types of failure through Failure Modes and Effects Analysis and calculating a Risk Priority Number for each item in your analysis. Lots of great stuff (and spreadsheets to help!) for understanding how to build a reliable system.

Spotify shares lessons learned from operating a large-scale event delivery system, which powers some key business functions like tracking track plays to calculate royalties. They describe some of the principles (e.g. segregating data by event type and choosing liveness over lateness), how they think about Service Level Objectives for event delivery, and how they are exposing their event platform to other internal teams.

Coupang, an e-commerce company from South Korea, writes about the evolution of their data platform over the past several years. They've been through several phases: storing all data in a relational database, using Hadoop, Hive, & a MPP database, and migrating to the cloud. The post dives into some of the other important features of their pipeline, like how they track data quality, detect data abnormalities, and make data discoverable to users.

TileDB is a new storage engine for multi-dimensional data like that commonly used for machine learning and genetics analysis. It supports both sparse and dense arrays, and it can use blob storage (such as S3) as the storage backend. The post describes the motivation for a new storage system, and how they've optimized the implementation for efficiency and cross language support.

Criteo writes about how they enforce data quality for the 450 PB that they have in their Hadoop clusters. Since Hadoop is very flexible/lenient in the data it supports, they run over 7,000 data quality checks per day using statistics from Hive tables and custom queries.

The morning paper covers Taiji, Facebook's system for routing and managing global traffic. Taiji takes advantage of knowledge about a user and their social connections to efficiently route traffic from the edge to a data center. By better taking advantage of caches, changes in routing materially reduce load on the database systems.

An in-depth look at how to the various Apache Kafka configurations for Producer and Consumers can be used to avoid duplicate data in your pipeline. This post describes the various types of delivery guarantees, how they map to Kafka settings, and discusses how to put them all together. The post has lots of great diagrams, and there's a bonus section on other strategies for deduplicating data in a stream.


Curated by Datadog ( )


Apache Kafka and Immutable Infrastructure + An Introductory Kafka Talk! (Santa Monica) - Tuesday, November 19

Learning How to Build-Event Streaming Applications with Pac-Man (San Diego) - Tuesday, November 19

Bay Area Apache Flink Meetup @ Cloudera (Palo Alto) - Wednesday, November 20

Building Data Lineage, Data Orchestration, and Data Mesh (Mountain View) - Thursday, November 21


The Rise of Apache Flink and Stream Processing (Bellevue) - Wednesday, November 20


Foreign-Key Joins with Kafka Streams (Plano) - Thursday, November 21


Introduction to Apache Spark (Madison) - Monday, November 18


Data Science and Engineering Club @ Zalando (Dublin) - Thursday, November 21


Fine-Tuning Kafka: Let's Look under the Hood! (Paris) - Wednesday, November 20


Apache Beam Intro and Use Cases (Antwerpen) - Friday, November 22


Kafka Is More ACID Than Your Database (Kongens Lyngby) - Tuesday, November 19


Apache Beam @ + Portable Schema + BeamSQL (Zurich) - Tuesday, November 19


Apache Kafka as a Database (Brno-stred) - Wednesday, November 20


Event-Driven Microservices with CQRS Using Axon: Excitingly Boring (Singapore) - Wednesday, November 20


Melbourne Data Engineering Meetup (Docklands) - Wednesday, November 20

Data Eng Weekly #329

Some great stuff in this week’s issue—Lyft’s open source metadata and data discovery platform, how Netflix uses GraphQL to build search indexes, an open source backup tool for Apache Cassandra, coverage of the Ceph distributed file systems evolution, and several other posts about Apache Airflow, Apache Spark, CockroachDB, event-driven microservices, and more!

Lyft writes about Amudsen, their open-source tool for metadata management and data discovery. The post covers the architecture of the system—the metadata, search, and frontend services as well as the databuilder (data ingestion framework). Since open sourcing earlier this year, they've had a number of contributions, such as one to make the datastore more extensible (supporting Apache Atlas in addition to Neo4j).

A look at how to use Docker Swarm to scale out Apache Airflow using a custom Airflow operator.

One data engineer's learnings after a year on the job. There are some good reflections on tools, automation (after a workflow of a certain size, a tool like Airflow is important), monitoring, and metadata/documentation housekeeping.

Netflix describes how they use GraphQL to build indexes ofrom data stored across multiple services. They key idea is to do issue a (batch) GraphQL query to return a full denormalized record (e.g. a show and its episodes, etc.) and store the results in Elasticsearch. Next, they can listen to changes on Kafka, and follow the relationships from the GraphQL schema to invalidate/reindex data.

Medusa is a new open source backup tool for Apache Cassandra. It stores backup data in cloud storage (e.g. Amazon S3 or Google Cloud Storage), and it creates smart incremental backups by taking advantage of the immutable nature of SSTables. There's much more details about the tools features and how to use it in this post on The Last Pickle blog.

An in-depth look at the Apache Kafka consumer rebalance protocol. The post describes the pieces of the protocol like JoinGroup, SyncGroup, Heartbeat, and LeaveGroup. It also looks at the recent additions of static membership and incremental cooperative rebalancing. There are lots of great diagrams to illustrate the key concepts.

A look at how to detect skew in your Apache Spark jobs, and several ways to fix a job with skew (hints, randomizing the join key, writing a custom partitioner). Which solution is best/fastest depends a bit on the inputs to your job.

The morning paper covers a paper on Ceph, the open-source distributed file system. Over the past few years, Ceph implemented a new store that bypasses a filesystemto better take advantage of SSD and HDD disks. The post describes the motivation of the changes, some of the other options they explored (including rocksdb), and the performance improvements they see with the new storage backend.

Cockroach Labs writes about how they've sped up distributed transactions with parallel commits, which avoids certain round trips across the WAN. The post describes the solution, including how failure handling works, and it shows that experimentally latency is cut in half.

This post describes why you should consider a relational database and consider taking advantage of advanced features (such as triggers and stored procedures). The author motivates based on experience working with jupyter notebooks and comparing the complexity of a NoSQL database like Mongo or Elasticsearch.

This article describes an event-driven architecture for maintaining a CRM and realtime database. An interesting component of the post details how to implement an audit system to ensure that all microservices consume the events. The basic idea is tag events with a unique ID and each microservice generates an audit event with that ID as it processes the event. Alerts are generated when (after aggregating based on a time window) the number of audit events for a particular ID is too low.


Curated by Datadog ( )


Data-Driven Development in Autonomous Driving + Spark Performance Tuning (Mountain View) - Tuesday, November 12

Data Engineering Meetup (San Diego) - Thursday, November 14


Mirror Maker 2.0 (Austin) - Tuesday, November 12


NOVA Data Engineering: First Meetup! (Herndon) - Thursday, November 14


Data Meetup (Sao Carlos) - Wednesday, November 13


Streaming Processing with Hazelcast Jet and Kafka (Stockholm) - Tuesday, November 12


Everything You Need to Know about Kafka Streams (A Coruna) - Thursday, November 14


Airflow @ SchoolMouv: Build, Schedule, and Monitor Pipelines at Scale (Toulouse) - Wednesday, November 13


Making Apache Spark Better with Delta Lake (Prague) - Thursday, November 14


QA in Beam + Beam Use Case + More! (Warsaw) - Thursday, November 14


Timeseries Forecasting as a Service + Run Spark and Flink Jobs on Kubernetes (Athens) - Thursday, November 14


Airflow Demystified + Big Data Demystified (Tel Aviv-Yafo) - Sunday, November 17


Kafka Beijing Meetup (Beijing) - Saturday, November 16


Kafka Is More ACID Than Your Database (Sydney) - Wednesday, November 13

K8s Meetup with Instaclustr & Google! (Pyrmont) - Wednesday, November 13

Sydney Data Engineering Meetup (Sydney) - Thursday, November 14

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Eng Weekly #328

Short and sweet issue this week, with several new open source tools—Beekeeper for cleaning up unused data, the Mantis project for real-time operations, and pg_flame's flame graphs for analyzing postgres queries—as well as implementation articles covering Apache Airflow, Rust for Kafka, and using bloom filters to optimize GDPR data deletion.

Beekeeper is a new open-source tool from Expedia for cleaning up unreferenced data in Apache Hive tables. It listens for metastore changes (like changes to table or partition locations) and periodically deletes abandoned data in S3. The github repo has instructions for trying it out using docker.

Netflix has open sourced Mantis, its operations-focused event analysis system. Mantis aims to provide cost-effective, event-based analysis of a live system. It does so by allowing applications to publish lots of events but only incur costs of those events if there's a subscriber of the event stream. At Netflix, they use Mantis for things like monitoring video streaming health, contextual alerting, measuring Cassandra health, and alerting on log events. The Mantis github page has lots more details about the system, including an overview of its architecture and details on how to try it out.

A look at how the Devoted Health team uses Apache Airflow. They deploy on Kubernetes (using Terraform and Helm), which allows developers to get their own instance of the stack for development/testing. They've built a tool for standardized DAG development (using YAML definitions) and a dev tool for synchronizing code changes to Kubernetes. The post also describes how they write integration tests and validations, deploy, and monitor their deployment.

Another post on Airflow deployments, Lyft writes about how they've implemented fine-grained secure access to the Apache Airflow UI. While Airflow has built-in RBAC, they built a custom security manager that adds DAG-level access permissions (defined alongside the DAG). At Lyft, each team has its own RBAC role and can decide which teams have access to the DAGs they publish.

A post on the Confluent blog describes porting a non-trivial Kafka application from Clojure to Rust. The author describes the tradeoffs between the various Rust libraries for Kafka, how to extend a client to support Avro records/Schema Registry, and shares some benchmarks comparing Clojure and Rust performance/memory usage.

The Adobe Experience team writes about how they use bloom filters to speed up deletions of user data, for requests like those in GDPR. By maintaining a bloom filter for each partition of a data set, they can prune large swaths of data that definitely don't contain records for a particular user when deleting.

`pg_flame` looks like a handy tool for visualizing the output of postgres' `EXPLAIN ANALYZE`. Mousing over an entry in the flame graph reveals more details about that step of the `EXPLAIN`.


Curated by Datadog ( )


Deploying Kafka SSL at Uber Scale + What’s the Time & Why? (Palo Alto) - Tuesday, October 29


Planning for the Varied Levels of Once-Data-Processing with Kafka (Scottsdale) - Tuesday, October 29


Everything You Want to Know about Apache Kafka, but Were Afraid to Ask (Sao Paulo) - Wednesday, October 30


Building Data Products in Production (London) - Tuesday, October 29


2 Years in Prod with a Hadoop Cluster (Villeurbanne) - Tuesday, October 29


Kafka at Sky Deutschland + Kafka on Kubernetes  (Unterfohring) - Tuesday, October 29


Kafka Connect (Torun) - Wednesday, October 30


Cloud Engineering at GoPro (Bucharest) - Tuesday, October 29


KSQL (Saint Petersburg) - Monday, October 28


Stream Processing with Apache Kafka, Flink, and Spark (Bengaluru) - Saturday, November 2


ETL, Kafka, and Elasticsearch Combined! (Sydney) - Tuesday, October 29

Data Eng Weekly #327

This week's issue covers quite the range of topics—Netflix's change data capture architecture, optimizing cloud costs at Segment, the Apache Arrow Flight protocol, Kubernetes operators/controllers, and Python concurrency. Also, a look at two new projects—MLflow Model Registry and DuckDB, an embedded columnar database engine.

The Apache Arrow blog writes about the new Arrow Flight protocol for sending data fast and efficiently (by sending data in Arrow format). The post goes into the motivation of Flight, describes some of the basics of a Flight server, describes how Flight builds on gRPC, and more. While it's still fairly early in the development process, Flight could prove to be really important for improving the efficiency of large scale data processing.

A look at the `pg_prewarm` extension for prewarming the PostgreSQL cache, including how to enable it to run automatically on server startup.

Netflix writes about Delta, their system for shuffling data between systems using change data capture (CDC). They've built delta connectors for MySQL and Postgres that stream data to Apache Kafka. The post discusses their Kafka configurations and the stream processing framework (built on Apache Flink) that processes the CDC data and enriches it to build denormalized records.

The MLflow Model Registry is a new extension to the MLflow project that provides an API and Web UI for uploading and promoting machine learning models across environments. It has first-class notions of environments/lifecycle stages (e.g. to promote from staging to production), which makes it a good mach for CI/CD tooling.

To speed up your Python scripts, you can use multithreading or multiprocessing. This post provides shows how, if you write your code in a functional way, you can introduce parallelism with only a few changes. It demonstrates the ThreadPoolExecutor, ProcessPoolExecutor, and the tradeoffs between the two.

In Kubernetes, operators and controllers are pretty common for stateful systems or those otherwise dealing with data. Even if you're not building a Kubernetes controller yourself, this post that describes the differences between the two is a good introduction.

Segment describes several optimizations that they made to improve their infrastructure costs. The changes are across all parts of the stack—from data systems to the javascript file that they're serving to customers. On the data front—they describe changes they made to their deployments of Apache Kafka (switching to instances with local storage) and NSQ (moving from a colocated model to a centralized cluster). They also made changes to minimize cross-AZ transfer costs—alterations to Kafka clients and service discovery to keep traffic inside of a single zone.

DuckDB is a new embedded, columnar database optimized for analytics workloads. This post shows how to use it via Python bindings, and it compares performance with SQLite on a few queries.


Curated by Datadog ( )

New York

Kafka on Kubernetes: Just Because You Can, Doesn't Mean You Should! (New York) - Tuesday, October 22


Free Apache Kafka Workshop (Boston) - Tuesday, October 22


10th Data Engineering Meetup (Belo Horizonte) - Wednesday, October 23


Kuberoo (London) - Thursday, October 24


Trustly Duchess Meetup: Introduction to Apache Kafka and Reactive Java (Stockholm) - Wednesday, October 23


Design Principles for an Event-Driven Architecture/Streaming with KSQL (Las Rozas de Madrid) - Thursday, October 24

Extending Spark for Qbeast's SQL DataSource (Barcelona) - Thursday, October 24


Data Engineering with Delta Lake, Pulsar, and Spark-Tools (Paris) - Tuesday, October 22


Full Day Apache Cassandra & Kafka Workshop (Berlin) - Monday, October 21

FREE NOW Data Journey to Kafka (Hamburg) - Tuesday, October 22

Cassandra Meets Kafka at ApacheCon! (Berlin) - Wednesday, October 23

Apache Kylin Meetup @ OLX (Berlin) - Thursday, October 24


Rg-Dev #32 (Rzeszow) - Thursday, October 24


Sydney Data Engineering Meetup (Surry Hills) - Thursday, October 24

Fintech Production with Kafka Streams (Melbourne) - Thursday, October 24

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Eng Weekly #326

Back after another week off—so we've got the best articles from the past two weeks. Several interesting new things to checkout this week—Bigslice and Bigmachine from GRAIL, an interesting strategy for turning change data capture events into audit events on the Debezium blog, and the SLOG system that aims to provide low-latency and strict serializability for multi-region systems. Lots more good stuff—posts on data pipelines, a look at new features in PostgreSQL 12, and auto scaling for Apache Airflow.

GameChanger writes about how they've (mostly) automated loading of data from their data pipeline to the data warehouse. Some friction comes from defining the schema in the data warehouse. A new tool was written to create definitions based on the Avro Schema from the Confluent Schema Registry.

The kafkacat CLI tool can be used for quick to setup (but not production ready) replication between Kafka clusters/topics. This post describes how to invoke it, and what some of the caveats are.

The Debezium blog shares the details of implementing a fascinating technique for building an audit log using change data capture data. The general idea is to to populate a secondary table keyed on transaction id with the details of the JWT that was used to perform the transaction. Then, Apache Kafka Streams is used to join the data between CDC streams for those tables. The post dives into how to build out this type of system in full detail (e.g. lots of sample code showing how to build 1) a JAX-RS Interceptor to automatically populate the table based on the JWT and 2) the Kafka Streams application).

PostgreSQL 12 was released a little over a week ago. The announcement describes some of the features (lots of performance improvements), and a second post on describes a new feature, generated columns. There are some interesting use cases for these, such as normalizing text data for searches.

GRAIL has open sourced Bigslice and Bigmachine, which enable distributed computation across large datasets using simple Golang programs. Unlike other big data tools, Bigslice spins up EC2 instances at runtime to distribute your computation. It exposes a high-level programming model (e.g. Map, Join, Filter) for batch processing. The introductory blog post and the github project have many more details, including how to get started (looks quite easy!)

I can't tell you how many times I've seen a syntax error because I tried to reference a table/column in a particular part of a SQL query. This post conveys when it's OK to cross-reference columns/tables defined in other components of a SQL query. There's a good cheat sheet if you might find it useful as a reference.

For those interested in distributed systems at global scale—this post dives into SLOG, which is new system designed to offer low latency and strict serializability by taking advantage of locality in client access patterns. The post gives a good introduction to the high-level intuition and system design, and if you want more the full VLDB paper is linked.

Facebook's scribe is a high throughput (2.5TB per second at peak) system for capturing log data. This post shares the high-level design of the system—covering topics like availability (e.g. buffering data to local disk in case of network issues), scalability, and multitenancy.

LinkedIn has open sourced the version of Apache Kafka that they run in production across thousands of brokers—they base on Apache Kafka release branches and add changes. The post talks about some of the improvements they've made, like better scalability by reusing UpdateMetadataRequest objects and a maintenance mode that makes it easier to cleanly take down a broker. The post also describes their development process, and how they integrate with the Apache Kafka upstream project.

A look at how to ensure you're getting the best performance out of postgres (things like partial indexes and increasing the shared buffer cache) as well as some advanced features you might not have known about like text search, geospacial indexes, hstore for key/value data, and JSON/XML data types.

A look at using the Kubernetes HorizontalPodAutoscaler to autoscale the workers of an Apache Airflow deployment. While the post has some details that are specific to Google Cloud Composer (a managed service for Apache Airflow), if you're interested in autoscaling your Airflow workers, this looks like a good place to get started.

Convoy writes about how the improved the latency of data loads to their data warehouse using KafkaConnect to load JSON data from Postgres via Debezium to Snowflake. The post has lots of practical details on deploying a production pipeline of this style.


Curated by Datadog ( )


Hadoop Rising: The Evolving Ecosystem (Boulder) - Thursday, October 17


Full-Day Apache Cassandra and Kafka Workshop (Chicago) - Thursday, October 17


Kafka & KSQL (Columbus) - Tuesday, October 15


Apache Beam Meetup 8: Streaming SQL in Beam + Beam Use Case by Huq Industries (London) - Wednesday, October 16


Apache Beam Meetup 2: Portability, Beam on Spark, and More! (Paris) - Thursday, October 17


Berlin AWS Group Meetup (Berlin) - Tuesday, October 15

Apache Kafka at Deutsche Bahn & Confluent Cloud (Frankfurt) - Wednesday, October 16


Managing Data Flows: Apache NiFi Deep Dive + Streaming Use Cases (Vienna) - Thursday, October 17


First Warsaw Airflow Meetup (Warsaw) - Thursday, October 17


MQTT and Apache Kafka: A Case Study of Uchumi Commercial Bank-Tanzania (Nairobi) - Saturday, October 19


Open Source Technologies at Expedia (Bangalore) - Wednesday, October 16


Apache Kafka and Microservices (Singapore) - Thursday, October 17


Viktor Gamov and George Hall Talk Kafka, Kubernetes, Connectors, and Operator (Docklands) - Tuesday, October 15

Kafka on Kubernetes: Does It Really Have to Be “The Hard Way”? (Sydney) - Thursday, October 17

FinTech Production with Kafka Streams (Melbourne) - Thursday, October 17

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Loading more posts…