Data Eng Weekly #332
This week's issue includes articles on Pinterest’s time series database, deploying Elasticsearch on Kubernetes, data & ML engineering at Slack, Airbnb's job queuing system, and the architecture of the Cliqz search engine. There are also a slew of posts on other topics from optimizing Kafka consumer to JSON processing in Golang.
A look at the inner workings of the Kafka consumers, with some real world recommendations for deploying them when there's high latency in talking to the Kafka cluster and/or a large number of partitions. There are tips on important metrics to monitor, configurations, garbage collector settings, and changing the partition.class to improve unbalanced consumers.
https://www.jesseyates.com/2019/12/04/vertically-scaling-kafka-consumers.html
This article looks at why hash maps (unsorted) are popular for in-memory indexes whereas b-tree (sorted) are common in databases. It describes the trade-offs of the two approaches, and how those best fit in-memory/db use cases.
https://www.evanjones.ca/ordered-vs-unordered-indexes.html
This post describes a large scale conversion of data into JSON format in order to load it into BigQuery. To meet the naming requirements of BigQuery, they had to remap field names on every JSON document. Their tool, which is written in Golang, uses a producer/consumer job queue to parallelize processing and partition the data before writing it out. They processed data both from Kafka and S3, and the post talks a bit about how they optimized interaction with S3.
https://itnext.io/parsing-18-billion-lines-json-with-go-738be6ee5ed2
Pinterest writes about how they've extended their time series data store, Goku, to support querying of historical data. They tier data by compacting data through rebucketing and downsampling. For serving, they load data from S3 into RocksDB. The post goes into the details of the design of their RocksDB setup, cluster management functions, and the query processing framework.
An overview of Kubernetes and how to deploy Elasticsearch on Kubernetes. This is a great introduction to many of the core concepts of Kubernetes (e.g. Deployment, Pod), including those that are important for running a stateful service (e.g. StatefulSet, PersistentVolumeClaim). It also shows how to configure accounts for ES using Kubernetes RBAC and use the helm package manager for deploying to Kubernetes.
https://sematext.com/blog/kubernetes-elasticsearch/
Available both in audio form and as a transcript, InfoQ has a podcast with Josh Wills that covers the evolution of data engineering and machine learning at Slack. The interview covers their data pipeline, which feeds into big data systems for BI/data warehousing and ML products. The interview also covers the kinds of products they build with machine learning and some thoughts on the future of observability for ML pipelines.
https://www.infoq.com/podcasts/slack-building-resilient-data-engineering/
Airbnb has open sourced Dynein, their job queuing system that they use for offloading tasks from the main request path and performing other asynchronous operations. It uses DynamoDB as a scheduler for future jobs and SQS for queuing—the post describes how this is built in a highly scalable way.
When you're running a high throughput system in Java, issues with garbage collection are inevitable. This post provides details on how to enable GC logs, how to interpret the details from the Concurrent Mark Sweep and the G1GC collectors, and some tools for visualizing the output. This is one of the most comprehensive guides that I've seen on the topic.
https://sematext.com/blog/java-garbage-collection-logs/
Cliqz, makers of a web search engine, have a comprehensive post on their architecture. Their near real-time indexing system is built with Apache Kafka, Apache Cassandra, and RocksDB while their batch indexing system is built on MapReduce and Spark with Luigi for managing workflows. The post also describes how they manage Kubernetes clusters, use Helm/Helmfile for package management, and leverage Tilt and K9s for local development. They also share on how they optimize costs and describe their machine learning pipelines.
Events
Curated by Datadog ( http://www.datadog.com )
California
Building a Best-in-Class Data Lake on AWS and Azure (Santa Clara) - Tuesday, December 17
https://www.meetup.com/datariders/events/266951424/
North Carolina
Zero to Observability with Apache Kafka (Raleigh) - Tuesday, December 17
https://www.meetup.com/Raleigh-Apache-Kafka-Meetup-by-Confluent/events/266917829/
CANADA
Introduction to Kafka (Montreal) - Wednesday, December 18
https://www.meetup.com/montreal-jug/events/266729844/
BRAZIL
Data Engineering Meetup (Belo Horizonte) - Wednesday, December 18
https://www.meetup.com/engenharia-de-dados/events/267072117/
SERBIA
Event Deduplication in Kafka + Navigation in a 3D Environment with RL (Novi Sad) - Wednesday, December 18
https://www.meetup.com/Big-Data-Novi-Sad/events/267060354/
ISRAEL
Spark Advanced Topics (Tel Aviv-Yafo) - Thursday, December 19
https://www.meetup.com/Women-in-Big-Data-Israel/events/266728256/
RUSSIA
Real-Time Data: Streaming and Collecting Data in Real Time (Moscow) - Wednesday, December 18
https://www.meetup.com/Data-People/events/266992802/
INDIA
Designing ETL Pipelines with Structured Streaming and Delta Lake (Bengaluru) - Wednesday, December 18
https://www.meetup.com/Bangalore-Apache-Spark-Meetup/events/266970481/
Apache Kafka and Stream Processing Meetup @ Walmart (Bengaluru) - Sunday, December 22
https://www.meetup.com/Bangalore-Apache-Kafka-Group/events/266777028/
SRI LANKA
Insights into HDInsights (Colombo) - Wednesday, December 18
https://www.meetup.com/sldatacommunity/events/267042058/
SOUTH KOREA
Flink Meetup (Seoul) - Tuesday, December 17
https://www.meetup.com/Seoul-Apache-Flink-Meetup/events/266824815/
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.