Data Eng Weekly #326
Back after another week off—so we've got the best articles from the past two weeks. Several interesting new things to checkout this week—Bigslice and Bigmachine from GRAIL, an interesting strategy for turning change data capture events into audit events on the Debezium blog, and the SLOG system that aims to provide low-latency and strict serializability for multi-region systems. Lots more good stuff—posts on data pipelines, a look at new features in PostgreSQL 12, and auto scaling for Apache Airflow.
GameChanger writes about how they've (mostly) automated loading of data from their data pipeline to the data warehouse. Some friction comes from defining the schema in the data warehouse. A new tool was written to create definitions based on the Avro Schema from the Confluent Schema Registry.
http://tech.gc.com/let-me-automate-that-for-you/
The kafkacat CLI tool can be used for quick to setup (but not production ready) replication between Kafka clusters/topics. This post describes how to invoke it, and what some of the caveats are.
https://rmoff.net/2019/09/29/copying-data-between-kafka-clusters-with-kafkacat/
The Debezium blog shares the details of implementing a fascinating technique for building an audit log using change data capture data. The general idea is to to populate a secondary table keyed on transaction id with the details of the JWT that was used to perform the transaction. Then, Apache Kafka Streams is used to join the data between CDC streams for those tables. The post dives into how to build out this type of system in full detail (e.g. lots of sample code showing how to build 1) a JAX-RS Interceptor to automatically populate the table based on the JWT and 2) the Kafka Streams application).
https://debezium.io/blog/2019/10/01/audit-logs-with-change-data-capture-and-stream-processing/
PostgreSQL 12 was released a little over a week ago. The announcement describes some of the features (lots of performance improvements), and a second post on pgdash.io describes a new feature, generated columns. There are some interesting use cases for these, such as normalizing text data for searches.
https://www.postgresql.org/about/news/1976/
https://pgdash.io/blog/postgres-12-generated-columns.html
GRAIL has open sourced Bigslice and Bigmachine, which enable distributed computation across large datasets using simple Golang programs. Unlike other big data tools, Bigslice spins up EC2 instances at runtime to distribute your computation. It exposes a high-level programming model (e.g. Map, Join, Filter) for batch processing. The introductory blog post and the github project have many more details, including how to get started (looks quite easy!)
https://medium.com/grail-eng/bigslice-a-cluster-computing-system-for-go-7e03acd2419b
I can't tell you how many times I've seen a syntax error because I tried to reference a table/column in a particular part of a SQL query. This post conveys when it's OK to cross-reference columns/tables defined in other components of a SQL query. There's a good cheat sheet if you might find it useful as a reference.
https://jvns.ca/blog/2019/10/03/sql-queries-don-t-start-with-select/
For those interested in distributed systems at global scale—this post dives into SLOG, which is new system designed to offer low latency and strict serializability by taking advantage of locality in client access patterns. The post gives a good introduction to the high-level intuition and system design, and if you want more the full VLDB paper is linked.
http://dbmsmusings.blogspot.com/2019/10/introducing-slog-cheating-low-latency.html
Facebook's scribe is a high throughput (2.5TB per second at peak) system for capturing log data. This post shares the high-level design of the system—covering topics like availability (e.g. buffering data to local disk in case of network issues), scalability, and multitenancy.
https://engineering.fb.com/data-infrastructure/scribe/
LinkedIn has open sourced the version of Apache Kafka that they run in production across thousands of brokers—they base on Apache Kafka release branches and add changes. The post talks about some of the improvements they've made, like better scalability by reusing UpdateMetadataRequest objects and a maintenance mode that makes it easier to cleanly take down a broker. The post also describes their development process, and how they integrate with the Apache Kafka upstream project.
https://engineering.linkedin.com/blog/2019/apache-kafka-trillion-messages
A look at how to ensure you're getting the best performance out of postgres (things like partial indexes and increasing the shared buffer cache) as well as some advanced features you might not have known about like text search, geospacial indexes, hstore for key/value data, and JSON/XML data types.
https://dev.to/heroku/postgres-is-underrated-it-handles-more-than-you-think-4ff3
A look at using the Kubernetes HorizontalPodAutoscaler to autoscale the workers of an Apache Airflow deployment. While the post has some details that are specific to Google Cloud Composer (a managed service for Apache Airflow), if you're interested in autoscaling your Airflow workers, this looks like a good place to get started.
https://medium.com/traveloka-engineering/enabling-autoscaling-in-google-cloud-composer-ac84d3ddd60
Convoy writes about how the improved the latency of data loads to their data warehouse using KafkaConnect to load JSON data from Postgres via Debezium to Snowflake. The post has lots of practical details on deploying a production pipeline of this style.
Events
Curated by Datadog ( http://www.datadog.com )
Colorado
Hadoop Rising: The Evolving Ecosystem (Boulder) - Thursday, October 17
https://www.meetup.com/Boulder-Denver-Big-Data/events/265388338/
Illinois
Full-Day Apache Cassandra and Kafka Workshop (Chicago) - Thursday, October 17
https://www.meetup.com/Chicago-SQL/events/263999278/
Ohio
Kafka & KSQL (Columbus) - Tuesday, October 15
https://www.meetup.com/MODUG-Mid-Ohio-Data-User-Group/events/265149919/
UNITED KINGDOM
Apache Beam Meetup 8: Streaming SQL in Beam + Beam Use Case by Huq Industries (London) - Wednesday, October 16
https://www.meetup.com/London-Apache-Beam-Meetup/events/263701679/
FRANCE
Apache Beam Meetup 2: Portability, Beam on Spark, and More! (Paris) - Thursday, October 17
https://www.meetup.com/Paris-Apache-Beam-Meetup/events/264545288/
GERMANY
Berlin AWS Group Meetup (Berlin) - Tuesday, October 15
https://www.meetup.com/aws-berlin/events/258598000/
Apache Kafka at Deutsche Bahn & Confluent Cloud (Frankfurt) - Wednesday, October 16
https://www.meetup.com/Frankfurt-Apache-Kafka-Meetup-by-Confluent/events/264399318/
AUSTRIA
Managing Data Flows: Apache NiFi Deep Dive + Streaming Use Cases (Vienna) - Thursday, October 17
https://www.meetup.com/futureofdata-vienna/events/264352692/
POLAND
First Warsaw Airflow Meetup (Warsaw) - Thursday, October 17
https://www.meetup.com/Warsaw-Airflow-Meetup/events/264867971/
KENYA
MQTT and Apache Kafka: A Case Study of Uchumi Commercial Bank-Tanzania (Nairobi) - Saturday, October 19
https://www.meetup.com/nairobi-jvm/events/265236973/
INDIA
Open Source Technologies at Expedia (Bangalore) - Wednesday, October 16
https://www.meetup.com/Data-Surfers/events/264806373/
SINGAPORE
Apache Kafka and Microservices (Singapore) - Thursday, October 17
https://www.meetup.com/Singapore-Kafka-Meetup/events/265194260/
AUSTRALIA
Viktor Gamov and George Hall Talk Kafka, Kubernetes, Connectors, and Operator (Docklands) - Tuesday, October 15
https://www.meetup.com/KafkaMelbourne/events/264970023/
Kafka on Kubernetes: Does It Really Have to Be “The Hard Way”? (Sydney) - Thursday, October 17
https://www.meetup.com/apache-kafka-sydney/events/265104559/
FinTech Production with Kafka Streams (Melbourne) - Thursday, October 17
https://www.meetup.com/melbourne-distributed/events/263797130/
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.