Data Eng Weekly #332

This week's issue includes articles on Pinterest’s time series database, deploying Elasticsearch on Kubernetes, data & ML engineering at Slack, Airbnb's job queuing system, and the architecture of the Cliqz search engine. There are also a slew of posts on other topics from optimizing Kafka consumer to JSON processing in Golang.

A look at the inner workings of the Kafka consumers, with some real world recommendations for deploying them when there's high latency in talking to the Kafka cluster and/or a large number of partitions. There are tips on important metrics to monitor, configurations, garbage collector settings, and changing the partition.class to improve unbalanced consumers.

This article looks at why hash maps (unsorted) are popular for in-memory indexes whereas b-tree (sorted) are common in databases. It describes the trade-offs of the two approaches, and how those best fit in-memory/db use cases.

This post describes a large scale conversion of data into JSON format in order to load it into BigQuery. To meet the naming requirements of BigQuery, they had to remap field names on every JSON document. Their tool, which is written in Golang, uses a producer/consumer job queue to parallelize processing and partition the data before writing it out. They processed data both from Kafka and S3, and the post talks a bit about how they optimized interaction with S3.

Pinterest writes about how they've extended their time series data store, Goku, to support querying of historical data. They tier data by compacting data through rebucketing and downsampling. For serving, they load data from S3 into RocksDB. The post goes into the details of the design of their RocksDB setup, cluster management functions, and the query processing framework.

An overview of Kubernetes and how to deploy Elasticsearch on Kubernetes. This is a great introduction to many of the core concepts of Kubernetes (e.g. Deployment, Pod), including those that are important for running a stateful service (e.g. StatefulSet, PersistentVolumeClaim). It also shows how to configure accounts for ES using Kubernetes RBAC and use the helm package manager for deploying to Kubernetes.

Available both in audio form and as a transcript, InfoQ has a podcast with Josh Wills that covers the evolution of data engineering and machine learning at Slack. The interview covers their data pipeline, which feeds into big data systems for BI/data warehousing and ML products. The interview also covers the kinds of products they build with machine learning and some thoughts on the future of observability for ML pipelines.

Airbnb has open sourced Dynein, their job queuing system that they use for offloading tasks from the main request path and performing other asynchronous operations. It uses DynamoDB as a scheduler for future jobs and SQS for queuing—the post describes how this is built in a highly scalable way.

When you're running a high throughput system in Java, issues with garbage collection are inevitable. This post provides details on how to enable GC logs, how to interpret the details from the Concurrent Mark Sweep and the G1GC collectors, and some tools for visualizing the output. This is one of the most comprehensive guides that I've seen on the topic.

Cliqz, makers of a web search engine, have a comprehensive post on their architecture. Their near real-time indexing system is built with Apache Kafka, Apache Cassandra, and RocksDB while their batch indexing system is built on MapReduce and Spark with Luigi for managing workflows. The post also describes how they manage Kubernetes clusters, use Helm/Helmfile for package management, and leverage Tilt and K9s for local development. They also share on how they optimize costs and describe their machine learning pipelines.


Curated by Datadog ( )


Building a Best-in-Class Data Lake on AWS and Azure (Santa Clara) - Tuesday, December 17

North Carolina

Zero to Observability with Apache Kafka (Raleigh) - Tuesday, December 17


Introduction to Kafka (Montreal) - Wednesday, December 18


Data Engineering Meetup (Belo Horizonte) - Wednesday, December 18


Event Deduplication in Kafka + Navigation in a 3D Environment with RL (Novi Sad) - Wednesday, December 18


Spark Advanced Topics (Tel Aviv-Yafo) - Thursday, December 19


Real-Time Data: Streaming and Collecting Data in Real Time (Moscow) - Wednesday, December 18


Designing ETL Pipelines with Structured Streaming and Delta Lake (Bengaluru) - Wednesday, December 18

Apache Kafka and Stream Processing Meetup @ Walmart (Bengaluru) - Sunday, December 22


Insights into HDInsights (Colombo) - Wednesday, December 18


Flink Meetup (Seoul) - Tuesday, December 17

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Eng Weekly #331

After a few weeks off (hopefully folks in the US had a nice Thanksgiving!), we're back with your weekly fix of data engineering articles. Apache Kafka and Apache Airflow are covered from several angels in this issue, and there are posts on the future of data engineering, columnar file formats, bloom filters, and Cruise's platform for data pipelines. Lots of great posts from folks building large scale data platforms!

This article provides an overview of a talk at the recent QCon San Francisco on the future of data engineering. The talk covers six stages of data engineering and what it takes to evolve from one stage to the next through the lense of the data architecture a WePay. The talk also covers what's ahead in the field. If you want to dive in more, the article links out to the slides for the presentation.

Zeebee is a workflow engine that can be used to execute and/or monitor workflows that span multiple microservices. This post looks at how it can integrate with Apache Kafka—as a source of data for monitoring or as a sink for publishing information about the state of the workflow. There are several good diagrams in the post to illustrate the key concepts.

Zulily writes about how they have evolved their Apache Airflow architecture—moving from celery executors to the Kubernetes executor, leveraging AWS RDS for metadata, and using AWS EFS for the DAGs. The post also describes their CI/CD workflow, and more.

A good introduction to the Apache Kafka Client Consumer's PartitionAssignor strategies. The post covers the three builtin strategies (Range, RoundRobin, StickyAssignor), the StreamsPartitionAssignor from Kafka Streams, and how to implement a custom strategy. As an example, the post walks through building a FailoverAssignor that could be used for an active/passive setup.

This article provides a good introduction to columnar file formats—describing how they physically store data (with an example of translating a CSV to a columnar CSV format), the benefits of columnar formats, and some of the trade-offs.

Pinterest writes about the service they've built to support large amounts of offline updates to their sharded MySQL cluster. This service, which exposes APIs for batch write operations, groups writes/updates based on operation type and shard. It also uses Kafka as a buffer—consumers fetch batch operation details and write to MySQL at a configured rate limit to keep the load of offline operations from impacting user-interactive queries. The post dives into technical details, including how they handle hot shards, variation in write operations, and the improvements they've seen from this new system.

A look at several techniques for monitoring Apache Kafka and related components. The post describes Quantyca's approach using MetricBeat with Burrow and Elastalert. They have an example of sending an alert to Slack.

Azkarra Streams is a new framework for building Apache Kafka Streams applications. It provides a library that eliminates a lot of the boilerplate of a typical streams application, and it has a built in HTTP server to monitor the state of your application(s), a simple DAG visualizer, and a builtin HTTP request endpoint to query your Kafka Streams stores (along with a web UI to look at results).

Cruise writes about Terra, their platform built on the Apache Beam SDK for data pipelines. Terra supplements Beam's features by adding permissions management, job submission (including pulling python/C++ dependencies), lineage, and more. The post has some sample code that shows how it all fits together.

Airtunnel is a new open source project that provides blueprints for building Apache Airflow DAGs. The project is designed for several design principles: consistency (e.g. in naming of data sets, scripts, and workflows), declarative first (Airtunnel uses YAML to declare data assets), and metadata driven. Airtunnel, which includes custom operators, metadata extensions to collect data asset lineage, and more, is available on github.

Bloom Filters are ubiquitous in distributed data stores because they can eliminate certain expensive operations. This post dives into the features of a bloom filter, how it works, and contains a basic implementation in Python.


Curated by Datadog ( )


Off the Ground w/ Apache Airflow + Ordinary People w/ Ability for Extraordinary (Santa Monica) - Thursday, December 12


Apache Kafka Committer & Co-Founder Jun Rao on Why Kafka Needs No Keeper (Minneapolis) - Monday, December 9


HCSC Big Data Hadoop Meetup (Chicago) - Wednesday, December 11


Running Apache Airflow at Kabbage (Atlanta) - Tuesday, December 10

North Carolina

Building a Stream Processing Architecture with Apache Kafka (Charlotte) - Wednesday, December 11

New York

An Introduction to Kafka Streams and KSQL (Webster) - Tuesday, December 10


December Apache Spark Meetup (Cambridge) - Tuesday, December 10


Data Science and Engineering Club (Dublin) - Wednesday, December 11


Apache Kafka: Metamorphosis (Lisboa) - Thursday, December 12


From Hadoop to NoSQL & Graph to Translytical (Middelharnis) - Tuesday, December 10


The Learnings of Karate Kid Applied to Apache Kafka (Berlin) - Wednesday, December 11

Apache NiFi + Hacking Around the IoTree (Frankfurt) - Wednesday, December 11

Kubernetes with Kafka Flavor (Berlin) - Thursday, December 12


Using PySpark with Google Colab + Spark 3.0 Preview (Milano) - Wednesday, December 11

It's a Streamer World! A Journey Through Processing Flows of Data (Milano) - Wednesday, December 11


First Warsaw Apache Airflow Workshop (Warsaw) - Friday, December 13


Building Consciousness on Real Time Events: ksqlDB Recipes (Chennai) - Wednesday, December 11

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Eng Weekly #330

A lot of breadth in this week's coverage of data engineering topics—from Delta to event processing at Spotify to the design goals of TileDB to understanding delivery guarantees in Apache Kafka. Also, a few articles on distributed systems—Yelp's autoscaling service, techniques for building reliable systems, and Facebook’s global routing infrastructure.

A look at how to use Delta's version tracking to "time travel" (read the contents of a table as it was at some previous point in time) and to produce audit details to track changes to a table. The post embeds several IPython notebooks with sample Scala code.

Yelp writes about their newly open-sourced autoscaler for Kubernetes, Clusterman. Compared to the Kubernetes Autoscaler, there are some interesting features like support for spot instances and the ability to simulate production workloads.

With the complexity that comes with a distributed system, it can be overwhelming to start the process of identifying and classifying failure modes. This post provides a great high level blueprint for how to approach this problem—including bucketing types of failure through Failure Modes and Effects Analysis and calculating a Risk Priority Number for each item in your analysis. Lots of great stuff (and spreadsheets to help!) for understanding how to build a reliable system.

Spotify shares lessons learned from operating a large-scale event delivery system, which powers some key business functions like tracking track plays to calculate royalties. They describe some of the principles (e.g. segregating data by event type and choosing liveness over lateness), how they think about Service Level Objectives for event delivery, and how they are exposing their event platform to other internal teams.

Coupang, an e-commerce company from South Korea, writes about the evolution of their data platform over the past several years. They've been through several phases: storing all data in a relational database, using Hadoop, Hive, & a MPP database, and migrating to the cloud. The post dives into some of the other important features of their pipeline, like how they track data quality, detect data abnormalities, and make data discoverable to users.

TileDB is a new storage engine for multi-dimensional data like that commonly used for machine learning and genetics analysis. It supports both sparse and dense arrays, and it can use blob storage (such as S3) as the storage backend. The post describes the motivation for a new storage system, and how they've optimized the implementation for efficiency and cross language support.

Criteo writes about how they enforce data quality for the 450 PB that they have in their Hadoop clusters. Since Hadoop is very flexible/lenient in the data it supports, they run over 7,000 data quality checks per day using statistics from Hive tables and custom queries.

The morning paper covers Taiji, Facebook's system for routing and managing global traffic. Taiji takes advantage of knowledge about a user and their social connections to efficiently route traffic from the edge to a data center. By better taking advantage of caches, changes in routing materially reduce load on the database systems.

An in-depth look at how to the various Apache Kafka configurations for Producer and Consumers can be used to avoid duplicate data in your pipeline. This post describes the various types of delivery guarantees, how they map to Kafka settings, and discusses how to put them all together. The post has lots of great diagrams, and there's a bonus section on other strategies for deduplicating data in a stream.


Curated by Datadog ( )


Apache Kafka and Immutable Infrastructure + An Introductory Kafka Talk! (Santa Monica) - Tuesday, November 19

Learning How to Build-Event Streaming Applications with Pac-Man (San Diego) - Tuesday, November 19

Bay Area Apache Flink Meetup @ Cloudera (Palo Alto) - Wednesday, November 20

Building Data Lineage, Data Orchestration, and Data Mesh (Mountain View) - Thursday, November 21


The Rise of Apache Flink and Stream Processing (Bellevue) - Wednesday, November 20


Foreign-Key Joins with Kafka Streams (Plano) - Thursday, November 21


Introduction to Apache Spark (Madison) - Monday, November 18


Data Science and Engineering Club @ Zalando (Dublin) - Thursday, November 21


Fine-Tuning Kafka: Let's Look under the Hood! (Paris) - Wednesday, November 20


Apache Beam Intro and Use Cases (Antwerpen) - Friday, November 22


Kafka Is More ACID Than Your Database (Kongens Lyngby) - Tuesday, November 19


Apache Beam @ + Portable Schema + BeamSQL (Zurich) - Tuesday, November 19


Apache Kafka as a Database (Brno-stred) - Wednesday, November 20


Event-Driven Microservices with CQRS Using Axon: Excitingly Boring (Singapore) - Wednesday, November 20


Melbourne Data Engineering Meetup (Docklands) - Wednesday, November 20

Data Eng Weekly #329

Some great stuff in this week’s issue—Lyft’s open source metadata and data discovery platform, how Netflix uses GraphQL to build search indexes, an open source backup tool for Apache Cassandra, coverage of the Ceph distributed file systems evolution, and several other posts about Apache Airflow, Apache Spark, CockroachDB, event-driven microservices, and more!

Lyft writes about Amudsen, their open-source tool for metadata management and data discovery. The post covers the architecture of the system—the metadata, search, and frontend services as well as the databuilder (data ingestion framework). Since open sourcing earlier this year, they've had a number of contributions, such as one to make the datastore more extensible (supporting Apache Atlas in addition to Neo4j).

A look at how to use Docker Swarm to scale out Apache Airflow using a custom Airflow operator.

One data engineer's learnings after a year on the job. There are some good reflections on tools, automation (after a workflow of a certain size, a tool like Airflow is important), monitoring, and metadata/documentation housekeeping.

Netflix describes how they use GraphQL to build indexes ofrom data stored across multiple services. They key idea is to do issue a (batch) GraphQL query to return a full denormalized record (e.g. a show and its episodes, etc.) and store the results in Elasticsearch. Next, they can listen to changes on Kafka, and follow the relationships from the GraphQL schema to invalidate/reindex data.

Medusa is a new open source backup tool for Apache Cassandra. It stores backup data in cloud storage (e.g. Amazon S3 or Google Cloud Storage), and it creates smart incremental backups by taking advantage of the immutable nature of SSTables. There's much more details about the tools features and how to use it in this post on The Last Pickle blog.

An in-depth look at the Apache Kafka consumer rebalance protocol. The post describes the pieces of the protocol like JoinGroup, SyncGroup, Heartbeat, and LeaveGroup. It also looks at the recent additions of static membership and incremental cooperative rebalancing. There are lots of great diagrams to illustrate the key concepts.

A look at how to detect skew in your Apache Spark jobs, and several ways to fix a job with skew (hints, randomizing the join key, writing a custom partitioner). Which solution is best/fastest depends a bit on the inputs to your job.

The morning paper covers a paper on Ceph, the open-source distributed file system. Over the past few years, Ceph implemented a new store that bypasses a filesystemto better take advantage of SSD and HDD disks. The post describes the motivation of the changes, some of the other options they explored (including rocksdb), and the performance improvements they see with the new storage backend.

Cockroach Labs writes about how they've sped up distributed transactions with parallel commits, which avoids certain round trips across the WAN. The post describes the solution, including how failure handling works, and it shows that experimentally latency is cut in half.

This post describes why you should consider a relational database and consider taking advantage of advanced features (such as triggers and stored procedures). The author motivates based on experience working with jupyter notebooks and comparing the complexity of a NoSQL database like Mongo or Elasticsearch.

This article describes an event-driven architecture for maintaining a CRM and realtime database. An interesting component of the post details how to implement an audit system to ensure that all microservices consume the events. The basic idea is tag events with a unique ID and each microservice generates an audit event with that ID as it processes the event. Alerts are generated when (after aggregating based on a time window) the number of audit events for a particular ID is too low.


Curated by Datadog ( )


Data-Driven Development in Autonomous Driving + Spark Performance Tuning (Mountain View) - Tuesday, November 12

Data Engineering Meetup (San Diego) - Thursday, November 14


Mirror Maker 2.0 (Austin) - Tuesday, November 12


NOVA Data Engineering: First Meetup! (Herndon) - Thursday, November 14


Data Meetup (Sao Carlos) - Wednesday, November 13


Streaming Processing with Hazelcast Jet and Kafka (Stockholm) - Tuesday, November 12


Everything You Need to Know about Kafka Streams (A Coruna) - Thursday, November 14


Airflow @ SchoolMouv: Build, Schedule, and Monitor Pipelines at Scale (Toulouse) - Wednesday, November 13


Making Apache Spark Better with Delta Lake (Prague) - Thursday, November 14


QA in Beam + Beam Use Case + More! (Warsaw) - Thursday, November 14


Timeseries Forecasting as a Service + Run Spark and Flink Jobs on Kubernetes (Athens) - Thursday, November 14


Airflow Demystified + Big Data Demystified (Tel Aviv-Yafo) - Sunday, November 17


Kafka Beijing Meetup (Beijing) - Saturday, November 16


Kafka Is More ACID Than Your Database (Sydney) - Wednesday, November 13

K8s Meetup with Instaclustr & Google! (Pyrmont) - Wednesday, November 13

Sydney Data Engineering Meetup (Sydney) - Thursday, November 14

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Eng Weekly #328

Short and sweet issue this week, with several new open source tools—Beekeeper for cleaning up unused data, the Mantis project for real-time operations, and pg_flame's flame graphs for analyzing postgres queries—as well as implementation articles covering Apache Airflow, Rust for Kafka, and using bloom filters to optimize GDPR data deletion.

Beekeeper is a new open-source tool from Expedia for cleaning up unreferenced data in Apache Hive tables. It listens for metastore changes (like changes to table or partition locations) and periodically deletes abandoned data in S3. The github repo has instructions for trying it out using docker.

Netflix has open sourced Mantis, its operations-focused event analysis system. Mantis aims to provide cost-effective, event-based analysis of a live system. It does so by allowing applications to publish lots of events but only incur costs of those events if there's a subscriber of the event stream. At Netflix, they use Mantis for things like monitoring video streaming health, contextual alerting, measuring Cassandra health, and alerting on log events. The Mantis github page has lots more details about the system, including an overview of its architecture and details on how to try it out.

A look at how the Devoted Health team uses Apache Airflow. They deploy on Kubernetes (using Terraform and Helm), which allows developers to get their own instance of the stack for development/testing. They've built a tool for standardized DAG development (using YAML definitions) and a dev tool for synchronizing code changes to Kubernetes. The post also describes how they write integration tests and validations, deploy, and monitor their deployment.

Another post on Airflow deployments, Lyft writes about how they've implemented fine-grained secure access to the Apache Airflow UI. While Airflow has built-in RBAC, they built a custom security manager that adds DAG-level access permissions (defined alongside the DAG). At Lyft, each team has its own RBAC role and can decide which teams have access to the DAGs they publish.

A post on the Confluent blog describes porting a non-trivial Kafka application from Clojure to Rust. The author describes the tradeoffs between the various Rust libraries for Kafka, how to extend a client to support Avro records/Schema Registry, and shares some benchmarks comparing Clojure and Rust performance/memory usage.

The Adobe Experience team writes about how they use bloom filters to speed up deletions of user data, for requests like those in GDPR. By maintaining a bloom filter for each partition of a data set, they can prune large swaths of data that definitely don't contain records for a particular user when deleting.

`pg_flame` looks like a handy tool for visualizing the output of postgres' `EXPLAIN ANALYZE`. Mousing over an entry in the flame graph reveals more details about that step of the `EXPLAIN`.


Curated by Datadog ( )


Deploying Kafka SSL at Uber Scale + What’s the Time & Why? (Palo Alto) - Tuesday, October 29


Planning for the Varied Levels of Once-Data-Processing with Kafka (Scottsdale) - Tuesday, October 29


Everything You Want to Know about Apache Kafka, but Were Afraid to Ask (Sao Paulo) - Wednesday, October 30


Building Data Products in Production (London) - Tuesday, October 29


2 Years in Prod with a Hadoop Cluster (Villeurbanne) - Tuesday, October 29


Kafka at Sky Deutschland + Kafka on Kubernetes  (Unterfohring) - Tuesday, October 29


Kafka Connect (Torun) - Wednesday, October 30


Cloud Engineering at GoPro (Bucharest) - Tuesday, October 29


KSQL (Saint Petersburg) - Monday, October 28


Stream Processing with Apache Kafka, Flink, and Spark (Bengaluru) - Saturday, November 2


ETL, Kafka, and Elasticsearch Combined! (Sydney) - Tuesday, October 29

Loading more posts…