Data Eng Weekly #324

In this week's issue, Robinhood and Zoomcar share their data infrastructure, and we learn about WePay's distributed write-ahead log (newly open sourced). There are also great articles on database tuning, the new garbage collectors in Java 11, testing distributed systems, and more.


Squarespace writes about how they drastically improved the performance of their MySQL deployment (p95 latency on response time went from 200ms to 50ms) backing their TLS infrastructure. The post talks about the architectural changes (making better use of hot read replicas, offloading unnecessary tasks) and tuning (connection pools, better indexes) that they made.

https://engineering.squarespace.com/blog/2019/performance-tuning-postgres-within-our-tls-infrastructure

An article on the importance of testing and formal verification in distributed systems. The author also argues that functional programming and static typing can help narrow the amount of testing and make formal verification easier.

https://blog.colinbreck.com/on-eliminating-error-in-distributed-software-systems/

Cloudera shares some benchmarking of the G1GC, ZGC, and CMS Java 11 garbage collectors with HBase. They use the Yahoo Cloud Serving Benchmark to evaluate performance and improved settings for the HBase workload.

https://blog.cloudera.com/cdh6-3-hbase-g1-gc-tuning-with-jdk11/

Zoomcar's data platform ingests data from a number of sources (mobility products as well as customer apps). They write about how the platform has evolved from analytics on a MySQL replica to a full-blown data platform with data in Kafka and S3. The post covers a lot of topics, such as how they ingest data from relational databases (plus schemas) and their clickstream.

https://medium.com/@shanker.sneh/https-medium-com-shanker-sneh-data-platform-at-zoomcar-a-narrative-part-i-f2455e3e2ae5

WePay has open sourced Waltz, which is a distributed write-ahead log. They use Waltz to as the primary store for transactions, and they materialize views of the data to the database for each service. Waltz has a lot of features for serializability, which are described (along with the architecture) in this post. Waltz uses ZooKeeper for cluster management, and it has separate server and storage nodes.

https://wecode.wepay.com/posts/waltz-a-distributed-write-ahead-log

Gojek shares some tips for configuring and tuning the Kafka Producer.

https://blog.gojekengineering.com/how-to-unlock-the-full-potential-of-kafka-producers-e1a6877e2167

Robinhood writes about the infrastructure powering their data lake, which processes over 10TB/day and houses over 4PB of data. They ingest data from Kafka, storying it in S3 for batch processing with Apache Spark, AWS Athena/Presto, and Redshift. Workflows are coordinated with Apache Airflow, and they use Looker for BI.

https://robinhood.engineering/data-lake-at-robinhood-3e9cdf963368

This post provides an introduction to SQL ROLLUP, which provides a mechanism to compute aggregates at multiple levels of a grouping (when your GROUP BY has multiple columns). It also looks at the CUBE keyword, which provides a mechanism for computing even more levels of aggregates.

https://dev.to/griffinator76/rollup-like-a-boss-3dkl

`fselect` is a handy CLI tool that presents a SQL-like query language for searching the file system (similar to *nix `find`). It also supports outputting results as JSON in addition to delimited text.

https://cli.fan/posts/fselect/


Events

Curated by Datadog

California

Apache Kafka Data Durability (San Jose) - Thursday, September 19 

https://www.meetup.com/BayLISA/events/264201177/

Washington

Evolving Data Technologies: Survey of Data Technology Trends (Bellevue) - Wednesday, September 18 

https://www.meetup.com/Big-Data-Bellevue-BDB/events/262650432/

Colorado

Real-Time Analytics with Apache Druid at Fullcontact (Denver) - Tuesday, September 17 

https://www.meetup.com/Denver-Apache-Druid-Meetup-by-Imply/events/264007236/

Wisconsin

Event-Driven Architecture with Kafka and Containers (Brookfield) - Wednesday, September 18 

https://www.meetup.com/Techmke/events/264255810/

District of Columbia

Survey of Real-Time Data Platforms: Cassandra, Spark, Akka, Kafka, Etc. (Washington) - Thursday, September 19 

https://www.meetup.com/Cassandra-DataStax-DC/events/264344711/

CANADA

Apache Kafka for the Enterprise: IBM Event Streams (Toronto) - Monday, September 16 

https://www.meetup.com/IBM-Cloud-Toronto/events/264134916/

IRELAND

Spark Meetup: Real-Time Edition (Dublin) - Thursday, September 19 

https://www.meetup.com/Dublin-Spark-Meetup/events/264285037/

UNITED KINGDOM

Parquet Optimisations + Building Spark Data Pipelines (London) - Wednesday, September 18 

https://www.meetup.com/Spark-London/events/264184808/

Building Stream Processing Applications with Apache Kafka Using KSQL (Manchester) - Thursday, September 19 

https://www.meetup.com/Manchester-Kafka/events/263968143/

FINLAND

Helsinki Apache Kafka Meetup (Helsinki) - Monday, September 16 

https://www.meetup.com/Helsinki-Apache-Kafka-Meetup/events/263025896/

SPAIN

Kafka Streams and the Tide of Data (Barcelona) - Wednesday, September 18 

https://www.meetup.com/Meetup-de-Big-Data-de-datahack-en-Barcelona/events/264490794/

FRANCE

FinistDevs: Apache Flink & WebAssembly (Le Relecq-Kerhuon) - Thursday, September 19 

https://www.meetup.com/FinistDevs/events/264598116/

GERMANY

Building Stream Processing Applications with Apache Kafka Using KSQL (Dortmund) - Tuesday, September 17 

https://www.meetup.com/Dortmund-Kafka/events/263555199/

On Track with Apache Kafka: Building a Streaming ETL Solution with Rail Data (Eschborn) - Wednesday, September 18 

https://www.meetup.com/Frankfurt-Apache-Kafka-Meetup-by-Confluent/events/263895041/

Orchestrate Kafka on Kubernetes + Kafka @DATEV (Nuremberg) - Wednesday, September 18 

https://www.meetup.com/Nurnberg-Kafka/events/264136545/

ITALY

Building Stream Processing Applications with Apache Kafka Using KSQL (Rome) - Monday, September 16 

https://www.meetup.com/Roma-Kafka-meetup-group/events/263968301/

POLAND

Dissolving the Problem: Kafka Is More ACID Than Your Database (Gdansk) - Monday, September 16 

https://www.meetup.com/Gdansk-Kafka/events/264429138/

BULGARIA

Riding Endless Streams with Kafka (Sofia) - Thursday, September 19 

https://www.meetup.com/Leanplum-Tech-Talks-Sofia/events/264368211/

AUSTRALIA

Melbourne Data Engineering Meetup (Melbourne) - Thursday, September 19 

https://www.meetup.com/Melbourne-Data-Engineering-Meetup/events/262799936/

Sydney Data Engineering Meetup (Surry Hills) - Thursday, September 19 

https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/262330474/

NEW ZEALAND

NZ Data Engineering Meetup #1: Snowflake and Your Data Lake (Auckland) - Thursday, September 19 

https://www.meetup.com/New-Zealand-Data-Engineering-Meetup/events/263637937/


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Eng Weekly #323

As long as you’re interested in microservices, asynchronous programming models, event sourcing, stream processing, or statically typed languages, then there should be something in this week’s issue for you. Lots of coverage from some big thinking (about ORMs and if they’re approaching the problem incorrectly) to practical advice about how to get the most of out of a review for analytics code.


This article is quite the introduction to working with Kafka Streams from Clojure, covering the main Kafka Streams API as well as the Willa library for writing idiomatic Clojure. The post also describes transducers in Clojure (a mechanism for building generic data transformations) and has a couple of useful examples.

https://clojure-conundrums.co.uk/posts/kafka-streams-the-clojure-way/

A good list of best practices in a microservices architecture. If you’re feeling pain with your deployments or architecture, then this post has lots of ideas for how to improve (like making sure you can upgrade a database schema without updating multiple services). The author notes that a half-baked microservices architecture can lead to failure scenarios that are more likely than they would be in a monolith.

https://blog.softwaremill.com/are-you-sure-youre-using-microservices-f8d4e912d014

When asking folks about ORMs, you're bound to get strong opinions. This piece describes why the main design goals of an ORM (i.e. modeling your data in your language/framework) can end up generating bad SQL queries. The post argues for another approach—starting with SQL and using that to generate your ORM.

https://abe-winter.github.io/2019/09/03/orms-backwards.html

The morning paper writes about SLOG, a new multi-region database system. Different from similar systems (Google Spanner and Calvin), SLOG introduces the idea of a "home" region. It has two modes of operation—either synchronously replicating within a single home region or a HA mode that replicates across regions. By taking advantage of the locality within a region, the system can improve throughput and latency.

https://blog.acolyer.org/2019/09/04/slog/

A collection of good code review tips applied to analytics code (i.e. mostly SQL or Python for data munging). For example, there's a section on consistently naming your data models and fields as well as one on DRYing up a DAG or a CTE.

https://blog.getdbt.com/how-to-review-an-analytics-pull-request/

Futures are used quite often in distributed systems on the JVM (and also in node.js and other systems). This post provides a good basic introduction using an example of building a parallel web crawler.

http://www.lihaoyi.com/post/EasyParallelProgrammingwithScalaFutures.html

A good list of pitfalls and anti-patterns both for those getting started with and looking at expanding usage of Apache Cassandra. For example—it's important to know your query pattern before you model your data, and Cassandra shouldn't be used as a queue. The post covers seven potential mistakes in some detail and mentions a few others to keep an eye out for, too.

https://blog.softwaremill.com/7-mistakes-when-using-apache-cassandra-51d2cf6df519

Jepsen has published a new analysis of YugaByte DB. In this post, they test the upcoming support for serializable transactions, and they find a few problems (including with DEFAULT values and anti-dependency cycles). As always, the Jepsen post has a good overview of YugaByte, the consistency model, the test design, and more.

https://jepsen.io/analyses/yugabyte-db-1.3.1

This article describes Derivative Event Sourcing. For legacy or other applications that you can't change, you can derive and publish events to an event stream for consumption by downstream applications. Change Data Capture via a database is a common mechanism for implementing this, but you could also derive using application or other logs.

https://www.confluent.io/blog/event-sourcing-vs-derivative-event-sourcing-explained

Since a lot of data infrastructure code tends to be written in Python (and also because I just found this post interesting!), I'm sharing Dropbox's post on how they rolled out type checking for their Python codebase. The post motivates why you might want static typing and it dives deep into the performance improvements the team made to scale to 5 million lines of code.

https://blogs.dropbox.com/tech/2019/09/our-journey-to-type-checking-4-million-lines-of-python/


Events

Curated by Datadog ( http://www.datadog.com )

California

Free Apache Spark One-Day Hands-On Workshop (Santa Clara) - Sunday, September 15

https://www.meetup.com/Big-Data-AI-101/events/264122535/ 

Ohio

Cleveland Big Data Mega Meetup (Cleveland) - Monday, September 9

https://www.meetup.com/Cleveland-Hadoop/events/261736663/

South Carolina

Beyond Stateless --> Stateful K8s with Do's and Don'ts (Greenville) - Thursday, September 12

https://www.meetup.com/Upstate-DevOps/events/263999330/

CANADA

Vancouver Spark Meetup @ Galvanize (Vancouver) - Thursday, September 12

https://www.meetup.com/Vancouver-Spark/events/263865177/

BRAZIL

Data Meetup 2019.2 (Sao Carlos) - Wednesday, September 11

https://www.meetup.com/opensanca/events/263966499/

UNITED KINGDOM

Building a Streaming ETL Solution with Rail Data (Leeds) - Wednesday, September 11

https://www.meetup.com/Leeds-Kafka/events/263368579/

Data Platform User Group: Cosmos DB, Spark (Leeds) - Thursday, September 12

https://www.meetup.com/dataleeds/events/255664614/

NORWAY

Safe Event Processing + Kafka at Norsk Tipping (Oslo) - Tuesday, September 10

https://www.meetup.com/Data-Engineering-Oslo/events/264216455/

GERMANY

GOTO Night with Erik Dornenburg & Kresten Thorup (Hamburg) - Monday, September 9

https://www.meetup.com/ThoughtWorks-Hamburg/events/264215436/

AUSTRIA

Data Natives Vienna v 7.0 (Vienna) - Thursday, September 12

https://www.meetup.com/Big-Data-Vienna/events/262995302/

ISRAEL

Journey of Two Streaming Frameworks: Spark Streaming and Kafka Streams (Tel Aviv-Yafo) - Sunday, September 15

https://www.meetup.com/Big-things-are-happening-here/events/264611212/

Women in Big Data Answer Any Question about Their Jobs (Tel Aviv-Yafo) - Sunday, September 15

https://www.meetup.com/Women-in-Big-Data-Israel/events/263937625/ 


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Eng Weekly #322

Lots of content to read through this week—everything from tutorials (introduction to Elasticsearch) to debugging stories (from the folks at Gitlab) to performance improvements (a nice speedup for Presto and deployment architecture improvements for Flink) to some talks from Leslie Lamport on Paxos and TLA+. Also a couple of interesting new tools (PartiQL and dqlite) to check out!


Technical

The Alibaba blog has a look at how Flink Improvement Proposal 6 updates the deployment architecture to support standalone, YARN, and Kubernetes deployments. For YARN, the new architecture will support spinning up workers ahead of time to improve latency of jobs.

https://www.alibabacloud.com/blog/deploy-apache-flink-natively-on-yarn-or-kubernetes_595189

PartiQL is a new query language and runtime for relational and nested data (such as that stored in a Parquet file). The query language is SQL-compliant—it's in use at Amazon to query data across several data stores (like Amazon S3, Amazon Redshift Spectrum, and more). There's an open source project on Github, and the Couchbase project is looking to adopt it.

https://aws.amazon.com/blogs/opensource/announcing-partiql-one-query-language-for-all-your-data/

The Presto blog has an article about a new optimization for queries over nested/array data that require unnesting. By using a dictionary encoding, they can avoid materializing the nested data. This improves performance by as much as 9x with even less cpu usage.

https://prestosql.io/blog/2019/08/23/unnest-operator-performance-enhancements.html

Leslie Lamport, who wont he Turing Award for his work in distributed systems, has shared two lectures on distributed computing. He covers Paxos and TLA+ in the two videos, and he also shares slides/write ups and follow up exercises if you want to dive even deeper.

https://lamport.azurewebsites.net/tla/paxos-algorithm.html

Amazon CTO Werner Vogels writes about how they build applications at Amazon—and no surprise distributed systems play a big role. He writes about their migration from a monolith to microservices, purpose-built databases, their operational model (focusing on serverless), and security.

https://www.allthingsdistributed.com/2019/08/modern-applications-at-aws.html

dqlite, which extends sqlite to run in a distributed, fault tolerant setup, just hit version 1.0. It uses Raft for data replication and its own wire protocol.

https://github.com/canonical/dqlite

Who doesn't love a complex distributed systems debug story with a happy ending? Gitlab writes about how they identified and resolved an issue with their SSHD fleet. They also share six lessons that they learned in the process.

https://about.gitlab.com/2019/08/27/tyranny-of-the-clock/

The morning paper covers an article that looks at the tradeoffs between various OLAP database services in AWS. The authors benchmark using TPC-H, and based on these tests they suggest storing data in S3 using a columnar format. This provides good data portability—enabling Athena and Redshift Specturm, two tools that can offer quite a cost savings for sporadic queries.

https://blog.acolyer.org/2019/08/30/choosing-a-cloud-dbms/

It's always interesting when a small change has a big impact. In this case, adding three lines of code (switching from a streaming inserts to low-frequency loads from Google Cloud Storage) resulted in a 95% drop in cost for loading data into BigQuery.

https://medium.com/@harshithdwivedi/trimming-down-over-95-of-your-bigquery-costs-using-file-loads-d08dd3d8b2fd

This article is a great introduction to Elasticsearch. It has both breath (covering some operational aspects, how to query data in elastic, and running Kibana) and depth (covering the details of how to efficiently load CSV data into elastic using the bulk insert API and refreshing indices afterwards). For those wanting to learn elastic—this is a great place to get started. And if you've used elastic before, there's a decent chance you'll learn something new.

https://towardsdatascience.com/from-scratch-to-search-setup-elasticsearch-under-4-minutes-load-a-csv-with-python-and-read-e31405d244f1


Events

Curated by Datadog ( http://www.datadog.com )

California

Data Engineering Meetup (San Diego) - Thursday, September 5 

https://www.meetup.com/Data-Engineering-San-Diego/events/263325527/

Colorado

Kafka as a Platform: The Ecosystem from the Ground Up (Greenwood Village) - Tuesday, September 3

https://www.meetup.com/DOSUG1/events/ztwqsqyzmbfb/

District of Columbia

DC Data Engineering 1: Fast! Big! Distributed! - Tuesday September 3

https://www.meetup.com/DC-Data-Engineering/events/263069053/

Stream Processing with the Spring Framework, Like You've Never Seen It Before (Washington) - Thursday, September 5

https://www.meetup.com/DC-Spring-Framework/events/263679369/ 

CANADA

Making Apache Spark Better with Delta Lake (Toronto) - Thursday, September 5

https://www.meetup.com/TAS-2-0-Toronto-Apache-Spark/events/264075662/

IRELAND

Apache Kafka and KSQL in Action! Let’s Build a Streaming Data Pipeline! (Dublin) - Thursday, September 5

https://www.meetup.com/Dublin-Apache-Kafka-Meetup-by-Confluent/events/263570908/

SPAIN

Self-Service Data Platforms with Spark, Kafka, and Avro (Barcelona) - Thursday, September 5

https://www.meetup.com/Spark-Barcelona/events/263484805/

FRANCE

Apache Spark Meetup @ AWS (Courbevoie) - Thursday, September 5

https://www.meetup.com/Paris-Spark-Meetup/events/264311260/

ISRAEL

Apache Kafka: Reaching the Castle (Herzliya) - Tuesday, September 3

https://www.meetup.com/Credorax-Group/events/264022450/

INDIA

"Kafka Day" with Neha Narkhede, Kafka Co-creator et al. (Bengaluru) - Sunday, September 8

https://www.meetup.com/Bangalore-Apache-Kafka-Group/events/264273358/

AUSTRALIA

A Small Introduction to Kafka and Using It as a Data Platform in Production (Melbourne) - Wednesday, September 4

https://www.meetup.com/KafkaMelbourne/events/264199735/


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Eng Weekly #321

25 August 2019

If you're reading this, then the transition to the new email service is a success! You should expect the same great content with a slightly new look.

As for this week's issue, there's coverage of some tools (DBT, Debezium for MySQL), distributed systems architecture (the Databricks Delta Lake transaction log, Timescale's distributed time series DB, and an overview of consistency and isolation levels), and posts on RocksDB and Twitter's new open source telemetry agent. There should be something good for everyone!


Technical

This tutorial describes how to enable the MySQL binary log for streaming change data capture (i.e. producing a record for each insert, update, delete into a MySQL table) to Apache Kafka using Debezium.

https://blog.clairvoyantsoft.com/mysql-cdc-with-apache-kafka-and-debezium-3d45c00762e4

Klarna writes about Diftong, their tool for validating changes to workflows by comparing data sets produced before and after the changes. Diftong is a general purpose tool that works with any two data sets sharing a schema by applying some general purpose techniques—deduplicating data and calculating row-/column-level statistics. There's a full paper on the tool and how it's used at Klarna if you want to learn even more than this post.

https://engineering.klarna.com/how-we-built-a-tool-for-validating-big-data-workflows-170c196a4493

The Delta Lake framework maintains a transaction log alongside a data set to provide atomicity. The transaction log is stored as JSON with each file representing a commit. This post dives into the details of this implementation, including optimizations using checkpoints, optimistic concurrency control, and handling of conflicts.

https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html

Timescale writes about their distributed time series database built on PostgreSQL, which is under development and in private beta. The post describes how they use "chunking" rather than "sharding" to distribute data across nodes in the cluster, presents the high level architecture (access and data nodes), and demonstrates how the system handles inserts and queries.

https://blog.timescale.com/blog/building-a-distributed-time-series-database-on-postgresql/

The Dremio blog covers a relatively new feature of Apache Arrow, the Flight data transfer protocol. Flight is built on gRPC and aims to saturate networks while also having low CPU overhead by using the Arrow in-memory data representation (i.e. no deserialization or serialization).

https://www.dremio.com/understanding-apache-arrow-flight/

Rezolus is a new open source telemetry agent from Twitter. It's written in Rust, and it implements sophisticated data collection and sampling in order to detect short (e.g. <10 seconds) anomalous events.

https://blog.twitter.com/engineering/en_us/topics/open-source/2019/introducing-rezolus.html

Rockset writes about how they improved performance of bulk loading data into RocksDB. They parallelize writes, optimize compactions, and more. Overall, they get a 20x speedup over the original approach.

https://www.rockset.com/blog/optimizing-bulk-load-in-rocksdb/

The Telegraph Engineering blog writes about dbt, the Data Building Tool, for building data transformations. It describes the major functionality of dbt, like its UI for viewing data sources and models, its framework for writing templated queries, and its functionality for building data check tests (e.g. guaranteeing unique values or that a column is never null in a data set).

https://medium.com/the-telegraph-engineering/dbt-a-new-way-to-handle-data-transformation-at-the-telegraph-868ce3964eb4

This post provides an overview of both isolation levels and consistency levels, and it describes why in many case you need guarantees for both. In many cases, we use terms to that describe both an isolation and consistency level, so it's all a bit complicated. But definitely worth understanding these terms if you're working with data systems that are throwing these terms around!

https://fauna.com/blog/demystifying-database-systems-part-4-isolation-levels-vs-consistency-levels


Events

Curated by Datadog ( http://www.datadog.com )

California

Apache Heron Hands-On (Sunnyvale) - Monday, August 26

https://www.meetup.com/Apache-Heron-Bay-Area/events/nglzdryzlbzb/

Apache Druid and YuniKorn: Universal Resource Scheduler for Both K8s and Yarn (San Francisco) - Wednesday, August 28

https://www.meetup.com/SF-Big-Analytics/events/263274680/

Arizona

Kafka Streams on Kubernetes with RocksDB & Ktables Plus Avro! (Scottsdale) - Tuesday, August 27

https://www.meetup.com/Kafka-Phoenix/events/263627979/

Virginia 

Kicking Your Database to the Curb (Reston) - Tuesday, August 27

https://www.meetup.com/Apache-Kafka-DC/events/263755964/

GERMANY

Apache Spark on Kubernetes (Frankfurt) - Tuesday, August 27

https://www.meetup.com/HSUG-Rhein-Main/events/263799697/

SWITZERLAND

Apache Kafka Meetup at Swiss Re (Zurich) - Monday, August 26

https://www.meetup.com/Zurich-Apache-Kafka-Meetup-by-Confluent/events/262386007/

FINLAND

Helsinki Apache Kafka Meetup (Helsinki) - Tuesday, August 27

https://www.meetup.com/Helsinki-Apache-Kafka-Meetup/events/263025817/

AUSTRALIA

Data Engineering Melbourne Meetup (Melbourne) - Thursday, August 29

https://www.meetup.com/Data-Engineering-Melbourne/events/258551305/


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Eng Weekly #318

21 July 2019

Several great posts this week, including some advice on operating distributed systems and a look a deduplicating data at scale at MixPanel. There are also several new tools in this week's issue—Ballista is a query engine written in Rust, LinkedIn has open sourced Brooklin, and OctoSQL is a new tool for querying across data sources.

Technical

Presto writes about implementing dynamic filtering in which they evaluate predicates when parsing files. By reading much less data, they see significant speed ups for joins that require shuffling data over the network.

https://prestosql.io/blog/2019/06/30/dynamic-filtering.html

WePay has a second post about Change Data Capture (CDC) for Apache Cassandra, in which they cover the architecture of the CDC agent, loading data from Kafka to BigQuery, and the transformations they do within BigQuery (they use views which are periodically materialized). Pretty interesting details about how they bootstrap the agent with a full table scan and the details of the commit log/queue processors. There's also some good advice on their philosophy for shipping a solution quickly by keeping scope small.

https://wecode.wepay.com/posts/streaming-cassandra-at-wepay-part-2

An engineer on the payment systems team at Uber shares several best practices for working with distributed systems. The post covers a lot of ground, including monitoring, alerting, outages, postmortems, failover drills, and much more. The post full of advice, such as valuing investment in keeping a system reliable as it scales up.

https://blog.pragmaticengineer.com/operating-a-high-scale-distributed-system/

The BRIN, or Block Range INdex, is a new (introduced in version 9.5) PostgreSQL index type. BRIN stores summary information about blocks of pages, which can be used to filter out pages at query time. It takes up less space than a B-Tree index, and has some other interesting characteristics. The Percona blog goes through this and other details

https://www.percona.com/blog/2019/07/16/brin-index-for-postgresql-dont-forget-the-benefits/

Ballista is a POC-stage data querying engine written in Rust with Kubernetes as the distributed orchestration layer. Applications can be written in SQL or using a dataframe-like API. Ballista is written by the same author as DataFusion, the Rust implementation of Apache Arrow.

https://andygrove.io/2019/07/announcing-ballista/

LinkedIn has open sourced Brooklin, their tool for streaming data between systems. Brooklin is a multi-tenant, dedicated service with dynamic provisioning/management (via REST endpoints). At LinkedIn, it's used to stream data between streaming systems (e.g. Kafka and Azure EventHubs as well as Kafka to Kafka), change data capture, and more. It currently supports MySQL, Cosmos DB, and Azure SQL as data sources as well as Kinesis, Cosmos DB, and Couchbase as destinations. The intro post has several more details.

https://engineering.linkedin.com/blog/2019/brooklin-open-source

Pinterest writes about their deployment of Presto. It's a rather large deployment—they have 14k CPU cores with 100 TBs of RAM across their fleet, and they service over 400,000 queries per month. They've built a ton of interesting analytics and monitoring about their clusters, such as measuring cluster stability by analyzing the number of started but not completed queries. They also talk about challenges—like deeply nested data structures, straggling workers, and more.

https://medium.com/@Pinterest_Engineering/presto-at-pinterest-a8bda7515e52

When they don't receive a 200 OK response, MixPanel's client applications retry the submission of analytics data, which can result in duplicates. At their scale, maintaining a global index of all event ids is expensive—so MixPanel has come up with a neat solution where they partition data and use in-memory indexes per partition to dedup.

https://engineering.mixpanel.com/2019/07/18/petabyte-scale-data-deduplication/

OctoSQL is a tool for querying multiple data sources—whether remote DBs or local files. It seems like it could be useful for joining a one-off dataset with some data in your main SQL data. OctoSQL is written in golang, and configuration of datasources is via a simple yaml configuration format.

https://github.com/cube2222/octosql


Events

Curated by Datadog ( http://www.datadog.com )

California

All about Streaming: Monitor Kafka Like a Pro + Apache Pulsar (San Francisco) - Wednesday, July 24

https://www.meetup.com/SF-Big-Analytics/events/261460187/

Bay Area K8s Meetup (San Jose) - Thursday, July 25

https://www.meetup.com/Bay-Area-Kubernetes-Meetup/events/263114269/

Ohio

Kafka Streams and GraphQL (Columbus) - Monday, July 22

https://www.meetup.com/MODUG-Mid-Ohio-Data-User-Group/events/261678077/

Georgia

Streaming with Kafka + Introduction to KSQL (Atlanta) - Thursday, July 25

https://www.meetup.com/Kafka-ATL/events/262260089/

North Carolina

Everything You Wanted to Know about Apache Kafka but Were Too Afraid to Ask! (Durham) - Monday, July 22

https://www.meetup.com/Raleigh-Apache-Kafka-Meetup-by-Confluent/events/262741628/

Learn about Apache Kafka (Charlotte) - Tuesday, July 23

https://www.meetup.com/charlottedevs/events/262195948/

Maryland

Kafka Top Ten Configurations (Columbia) - Thursday, July 25

https://www.meetup.com/devops-columbia/events/261247340/

CANADA

Challenges That Everyone Struggles with While Productionizing Apache Spark (Toronto) - Wednesday, July 24

https://www.meetup.com/scalator/events/262986601/

SWEDEN

Apache Beam Meetup 3: Beam Portability + Visual Pipeline Development + ML w/ Beam (Stockholm) - Thursday, July 25

https://www.meetup.com/Apache-Beam-Stockholm/events/262263586/

PORTUGAL

Streaming Your Events with Kafka Streams & KSQL (Lisboa) - Wednesday, July 24

https://www.meetup.com/Lisbon-Apache-Kafka-Meetup-by-Confluent/events/263181213/

ISRAEL

Data Fundamentals for Growing Companies: Culture & Architecture (Tel Aviv-Yafo) - Monday, July 22

https://www.meetup.com/WeWork-Technology-TLV/events/262941950/

AUSTRALIA

Melbourne Data Engineering Meetup (Docklands) - Wednesday, July 24

https://www.meetup.com/Melbourne-Data-Engineering-Meetup/events/262658320/


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Loading more posts…