Data Eng Weekly #323

As long as you’re interested in microservices, asynchronous programming models, event sourcing, stream processing, or statically typed languages, then there should be something in this week’s issue for you. Lots of coverage from some big thinking (about ORMs and if they’re approaching the problem incorrectly) to practical advice about how to get the most of out of a review for analytics code.

This article is quite the introduction to working with Kafka Streams from Clojure, covering the main Kafka Streams API as well as the Willa library for writing idiomatic Clojure. The post also describes transducers in Clojure (a mechanism for building generic data transformations) and has a couple of useful examples.

A good list of best practices in a microservices architecture. If you’re feeling pain with your deployments or architecture, then this post has lots of ideas for how to improve (like making sure you can upgrade a database schema without updating multiple services). The author notes that a half-baked microservices architecture can lead to failure scenarios that are more likely than they would be in a monolith.

When asking folks about ORMs, you're bound to get strong opinions. This piece describes why the main design goals of an ORM (i.e. modeling your data in your language/framework) can end up generating bad SQL queries. The post argues for another approach—starting with SQL and using that to generate your ORM.

The morning paper writes about SLOG, a new multi-region database system. Different from similar systems (Google Spanner and Calvin), SLOG introduces the idea of a "home" region. It has two modes of operation—either synchronously replicating within a single home region or a HA mode that replicates across regions. By taking advantage of the locality within a region, the system can improve throughput and latency.

A collection of good code review tips applied to analytics code (i.e. mostly SQL or Python for data munging). For example, there's a section on consistently naming your data models and fields as well as one on DRYing up a DAG or a CTE.

Futures are used quite often in distributed systems on the JVM (and also in node.js and other systems). This post provides a good basic introduction using an example of building a parallel web crawler.

A good list of pitfalls and anti-patterns both for those getting started with and looking at expanding usage of Apache Cassandra. For example—it's important to know your query pattern before you model your data, and Cassandra shouldn't be used as a queue. The post covers seven potential mistakes in some detail and mentions a few others to keep an eye out for, too.

Jepsen has published a new analysis of YugaByte DB. In this post, they test the upcoming support for serializable transactions, and they find a few problems (including with DEFAULT values and anti-dependency cycles). As always, the Jepsen post has a good overview of YugaByte, the consistency model, the test design, and more.

This article describes Derivative Event Sourcing. For legacy or other applications that you can't change, you can derive and publish events to an event stream for consumption by downstream applications. Change Data Capture via a database is a common mechanism for implementing this, but you could also derive using application or other logs.

Since a lot of data infrastructure code tends to be written in Python (and also because I just found this post interesting!), I'm sharing Dropbox's post on how they rolled out type checking for their Python codebase. The post motivates why you might want static typing and it dives deep into the performance improvements the team made to scale to 5 million lines of code.


Curated by Datadog ( )


Free Apache Spark One-Day Hands-On Workshop (Santa Clara) - Sunday, September 15 


Cleveland Big Data Mega Meetup (Cleveland) - Monday, September 9

South Carolina

Beyond Stateless --> Stateful K8s with Do's and Don'ts (Greenville) - Thursday, September 12


Vancouver Spark Meetup @ Galvanize (Vancouver) - Thursday, September 12


Data Meetup 2019.2 (Sao Carlos) - Wednesday, September 11


Building a Streaming ETL Solution with Rail Data (Leeds) - Wednesday, September 11

Data Platform User Group: Cosmos DB, Spark (Leeds) - Thursday, September 12


Safe Event Processing + Kafka at Norsk Tipping (Oslo) - Tuesday, September 10


GOTO Night with Erik Dornenburg & Kresten Thorup (Hamburg) - Monday, September 9


Data Natives Vienna v 7.0 (Vienna) - Thursday, September 12


Journey of Two Streaming Frameworks: Spark Streaming and Kafka Streams (Tel Aviv-Yafo) - Sunday, September 15

Women in Big Data Answer Any Question about Their Jobs (Tel Aviv-Yafo) - Sunday, September 15 

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.