Data Eng Weekly #325
A few more posts than usual this week, since I took last week off. There's lots of great articles, including more than normal about some batch processing tech—Apache Hadoop MapReduce, Apache Hadoop YARN schedulers, and extending Apache Hive ACID tables to other processing engines. There's also a great post on testing distributed systems, two posts on Apache HBase, a post from Dream11 on their real-time alerting pipeline, and a deep dive into the Apache Kafka client rebalancing protocol. Lots of great stuff to read through this week!
This post looks at testing distributed databases with Jepsen. It describes the types of tests Jepsen performs (e.g. a bank account simulation, a monotonically increasing counter) as well as the common failure scenarios (e.g. failures during membership changes, poor claims of serialization guarantees). There's a lot of good stuff in here if you're interested in learning more about testing distributed systems.
LiveRamp writes about their massive MapReduce-based data joining pipeline. They process 10 PB of data per day using 25k cores. At that scale, optimizations can make a big difference—for their map-side join, they use combiners, an indexed sorted-merge, and a virtual partitioning scheme. Apache Hadoop MapReduce is a tried and true framework, and it's interesting to read about an application that's pushing its performance limits.
I've heard of lots of folks using AWS Glue Data Catalog but less so on other parts of the Glue service. This post provides a good intro to those other parts: the crawlers (some recommendations for file formats to work best), Glue jobs (which execute Apache Spark jobs), and using the Glue development endpoint for testing. Looks like a good place to get started, if you're interested in checking Glue out.
https://medium.com/@techatcore/aws-glue-lesson-learned-437d73f3e988
A tour of the various types of databases that have existed in the lifetime of modern computers, including some lesser known/systems you might not typically think of as databases—like csv files and DNS. Probably not too surprising for this crowd—there's been a lot of innovation (document databases, graph databases, NewSQL) in the past two decades.
https://www.prisma.io/blog/comparison-of-database-models-1iz9u29nwn37
Dream11 writes about the architecture of their real time alerting system, which is built with Apache Kafka Connect, KSQL, and Elasticsearch. They use Debezium to mirror updates from their OLTP systems, enrich data with KSQL, send it to Elasticsearch, and use ElastAlert to trigger alerts for anomalies. Their pipeline has less than a minute of latency and can handle ~40k events/sec.
The first part in a series on the Apache Hadoop YARN capacity scheduler, this post provides some practical recommendations for configuration. It covers the important settings, and whether they should be set on each individual queue or inherited from the parent. The post also has some good visualizations of how jobs fit into queues.
https://www.davidmcginnis.net/post/yarn-capacity-scheduler-and-node-labels-part-1
Qubole has extended support for Apache Hive ACID tables to Apache Spark and Presto. In this post, they describe why they chose Hive's transaction mechanism amongst alternatives, the extensions they've implemented in Spark and Presto, and the changes they made so that Hive ACID tables work well with an Object Store like Amazon S3. The code is open source, and there are some quick start instructions in the repos.
KarelDB is a relational database that uses Apache Kafka for persistence, Apache Calcite for the SQL engine, and Apache Omid for transactions and multiversion concurrency control. It's not a streaming database—but rather a database built entirely on these open source components. It's in early stages (only supporting a single node) but is interesting nonetheless!
https://yokota.blog/2019/09/23/building-a-relational-database-using-kafka/
Airbnb writes about the tradeoffs in building a data pipeline—namely that as you break up your monolithic jobs into smaller components, you increase both the depth and width of your graph. In doing so, there tends to be more aggregate overhead (i/o, scheduler delay, etc), which can greatly increase the end-to-end time of your pipeline. They formalize a lot of these ideas in the post, and they have an interesting case study in which they speed up a pipeline >4x.
https://medium.com/airbnb-engineering/scaling-a-mature-data-pipeline-managing-overhead-f34835cbc866
This post describes the new Incremental Cooperative Rebalancing implementation in Apache Kafka 2.3. There's a great introduction to the Kafka Client balancing protocol, including how it piggybacks on the broker's group management API by delegating to a leader client. Next, the post describes the challenges with the previous "stop the world" approach to rebalancing, it walks through several examples of how cooperative rebalancing works, and it shares some experimental results.
https://www.confluent.io/blog/incremental-cooperative-rebalancing-in-kafka
Cloudera writes about how they've extended Apache HBase to use Amazon S3 (or other object store) for its main storage. While the bulk of data can be stored in S3, HBase also requires a file system that provides fsync semantics for its write-ahead logs so they continue to use HDFS for that.
https://blog.cloudera.com/how-clouderas-hbase-can-leverage-amazons-s3/
Pinterest describes improvements they've made to the PinalyticsDB times series database. Built on HBase, the system hit some new scalability bottlenecks as usage grew within Pinterest. They've updated the schema design (introducing a salt), improved coprocessor performance (coalescing scan requests and aggregating results before returning them), and implementing some optimizations for large reports relying on lots of metrics.
Events
Curated by Datadog ( http://www.datadog.com )
California
Post-Summit Meetup w/ Airbnb, Bloomberg, and Confluent (San Francisco) - Tuesday, October 1
https://www.meetup.com/KafkaBayArea/events/264562779/
Stream Processing with Apache Kafka & Apache Samza (Sunnyvale) - Thursday, October 3
https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/264589317/
Data Platform Observability (Mountain View) - Thursday, October 3
https://www.meetup.com/democratize_data/events/265105830/
Illinois
Securing Apache Kafka Clusters (Chicago) - Thursday, October 3
https://www.meetup.com/cloud-collective/events/264531201/
New York
Design Patterns of Streaming Platforms (New York) - Monday, September 30
https://www.meetup.com/DataCouncil-AI-NYC-Data-Engineering-Science/events/264748638/
Intro to Predictive Data Analysis Using Apache Kafka, Spark, Zeppelin on JVM (New York) - Wednesday, October 2
https://www.meetup.com/ibmcodenyc/events/264557948/
UNITED KINGDOM
How to Manage State in Apache Kafka + Beyond REST: GraphQL (Leeds) - Wednesday, October 2
https://www.meetup.com/Leeds-JVMThing/events/264720580/
SPAIN
Is It All Perfect in Spark? (Barcelona) - Monday, September 30
https://www.meetup.com/Spark-Barcelona/events/264581714/
FRANCE
How Do Microservices Communicate with Kafka? (Levallois-Perret) - Thursday, October 3
https://www.meetup.com/Digital-Experience-Tribe/events/264918749/
Paris Kafka Meetup After Summer @ Datadog (Paris) - Thursday, October 3
https://www.meetup.com/Paris-Apache-Kafka-Meetup/events/264701824/
NETHERLANDS
RxJs with React Native + Kafka with GraphQL (Rotterdam) - Wednesday, October 2
https://www.meetup.com/React-Rotterdam/events/264307774/
Meetup @ Cognizant: Building Cloud Data Platforms (Amsterdam) - Wednesday, October 2
https://www.meetup.com/DataCouncil-Amsterdam/events/263932065/
AUSTRALIA
Special Edition: Data Engineering on AWS (Sydney) - Thursday, October 3
https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/263548665/
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.
If you didn't receive this email directly, and you'd like to subscribe to weekly emails please visit https://dataengweekly.com