Data Eng Weekly #322
Lots of content to read through this week—everything from tutorials (introduction to Elasticsearch) to debugging stories (from the folks at Gitlab) to performance improvements (a nice speedup for Presto and deployment architecture improvements for Flink) to some talks from Leslie Lamport on Paxos and TLA+. Also a couple of interesting new tools (PartiQL and dqlite) to check out!
Technical
The Alibaba blog has a look at how Flink Improvement Proposal 6 updates the deployment architecture to support standalone, YARN, and Kubernetes deployments. For YARN, the new architecture will support spinning up workers ahead of time to improve latency of jobs.
https://www.alibabacloud.com/blog/deploy-apache-flink-natively-on-yarn-or-kubernetes_595189
PartiQL is a new query language and runtime for relational and nested data (such as that stored in a Parquet file). The query language is SQL-compliant—it's in use at Amazon to query data across several data stores (like Amazon S3, Amazon Redshift Spectrum, and more). There's an open source project on Github, and the Couchbase project is looking to adopt it.
https://aws.amazon.com/blogs/opensource/announcing-partiql-one-query-language-for-all-your-data/
The Presto blog has an article about a new optimization for queries over nested/array data that require unnesting. By using a dictionary encoding, they can avoid materializing the nested data. This improves performance by as much as 9x with even less cpu usage.
https://prestosql.io/blog/2019/08/23/unnest-operator-performance-enhancements.html
Leslie Lamport, who wont he Turing Award for his work in distributed systems, has shared two lectures on distributed computing. He covers Paxos and TLA+ in the two videos, and he also shares slides/write ups and follow up exercises if you want to dive even deeper.
https://lamport.azurewebsites.net/tla/paxos-algorithm.html
Amazon CTO Werner Vogels writes about how they build applications at Amazon—and no surprise distributed systems play a big role. He writes about their migration from a monolith to microservices, purpose-built databases, their operational model (focusing on serverless), and security.
https://www.allthingsdistributed.com/2019/08/modern-applications-at-aws.html
dqlite, which extends sqlite to run in a distributed, fault tolerant setup, just hit version 1.0. It uses Raft for data replication and its own wire protocol.
https://github.com/canonical/dqlite
Who doesn't love a complex distributed systems debug story with a happy ending? Gitlab writes about how they identified and resolved an issue with their SSHD fleet. They also share six lessons that they learned in the process.
https://about.gitlab.com/2019/08/27/tyranny-of-the-clock/
The morning paper covers an article that looks at the tradeoffs between various OLAP database services in AWS. The authors benchmark using TPC-H, and based on these tests they suggest storing data in S3 using a columnar format. This provides good data portability—enabling Athena and Redshift Specturm, two tools that can offer quite a cost savings for sporadic queries.
https://blog.acolyer.org/2019/08/30/choosing-a-cloud-dbms/
It's always interesting when a small change has a big impact. In this case, adding three lines of code (switching from a streaming inserts to low-frequency loads from Google Cloud Storage) resulted in a 95% drop in cost for loading data into BigQuery.
This article is a great introduction to Elasticsearch. It has both breath (covering some operational aspects, how to query data in elastic, and running Kibana) and depth (covering the details of how to efficiently load CSV data into elastic using the bulk insert API and refreshing indices afterwards). For those wanting to learn elastic—this is a great place to get started. And if you've used elastic before, there's a decent chance you'll learn something new.
Events
Curated by Datadog ( http://www.datadog.com )
California
Data Engineering Meetup (San Diego) - Thursday, September 5
https://www.meetup.com/Data-Engineering-San-Diego/events/263325527/
Colorado
Kafka as a Platform: The Ecosystem from the Ground Up (Greenwood Village) - Tuesday, September 3
https://www.meetup.com/DOSUG1/events/ztwqsqyzmbfb/
District of Columbia
DC Data Engineering 1: Fast! Big! Distributed! - Tuesday September 3
https://www.meetup.com/DC-Data-Engineering/events/263069053/
Stream Processing with the Spring Framework, Like You've Never Seen It Before (Washington) - Thursday, September 5
https://www.meetup.com/DC-Spring-Framework/events/263679369/
CANADA
Making Apache Spark Better with Delta Lake (Toronto) - Thursday, September 5
https://www.meetup.com/TAS-2-0-Toronto-Apache-Spark/events/264075662/
IRELAND
Apache Kafka and KSQL in Action! Let’s Build a Streaming Data Pipeline! (Dublin) - Thursday, September 5
https://www.meetup.com/Dublin-Apache-Kafka-Meetup-by-Confluent/events/263570908/
SPAIN
Self-Service Data Platforms with Spark, Kafka, and Avro (Barcelona) - Thursday, September 5
https://www.meetup.com/Spark-Barcelona/events/263484805/
FRANCE
Apache Spark Meetup @ AWS (Courbevoie) - Thursday, September 5
https://www.meetup.com/Paris-Spark-Meetup/events/264311260/
ISRAEL
Apache Kafka: Reaching the Castle (Herzliya) - Tuesday, September 3
https://www.meetup.com/Credorax-Group/events/264022450/
INDIA
"Kafka Day" with Neha Narkhede, Kafka Co-creator et al. (Bengaluru) - Sunday, September 8
https://www.meetup.com/Bangalore-Apache-Kafka-Group/events/264273358/
AUSTRALIA
A Small Introduction to Kafka and Using It as a Data Platform in Production (Melbourne) - Wednesday, September 4
https://www.meetup.com/KafkaMelbourne/events/264199735/
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.