Data Eng Weekly #330

A lot of breadth in this week's coverage of data engineering topics—from Delta to event processing at Spotify to the design goals of TileDB to understanding delivery guarantees in Apache Kafka. Also, a few articles on distributed systems—Yelp's autoscaling service, techniques for building reliable systems, and Facebook’s global routing infrastructure.


A look at how to use Delta's version tracking to "time travel" (read the contents of a table as it was at some previous point in time) and to produce audit details to track changes to a table. The post embeds several IPython notebooks with sample Scala code.

https://medium.com/@aravinthR/delta-time-travel-for-data-lake-part-2-b23879a4bc6d

Yelp writes about their newly open-sourced autoscaler for Kubernetes, Clusterman. Compared to the Kubernetes Autoscaler, there are some interesting features like support for spot instances and the ability to simulate production workloads.

https://engineeringblog.yelp.com/2019/11/open-source-clusterman.html

With the complexity that comes with a distributed system, it can be overwhelming to start the process of identifying and classifying failure modes. This post provides a great high level blueprint for how to approach this problem—including bucketing types of failure through Failure Modes and Effects Analysis and calculating a Risk Priority Number for each item in your analysis. Lots of great stuff (and spreadsheets to help!) for understanding how to build a reliable system.

https://medium.com/@adrianco/failure-modes-and-continuous-resilience-6553078caad5

Spotify shares lessons learned from operating a large-scale event delivery system, which powers some key business functions like tracking track plays to calculate royalties. They describe some of the principles (e.g. segregating data by event type and choosing liveness over lateness), how they think about Service Level Objectives for event delivery, and how they are exposing their event platform to other internal teams.

https://labs.spotify.com/2019/11/12/spotifys-event-delivery-life-in-the-cloud/

Coupang, an e-commerce company from South Korea, writes about the evolution of their data platform over the past several years. They've been through several phases: storing all data in a relational database, using Hadoop, Hive, & a MPP database, and migrating to the cloud. The post dives into some of the other important features of their pipeline, like how they track data quality, detect data abnormalities, and make data discoverable to users.

https://medium.com/coupang-tech/evolving-the-coupang-data-platform-308e305a9c45

TileDB is a new storage engine for multi-dimensional data like that commonly used for machine learning and genetics analysis. It supports both sparse and dense arrays, and it can use blob storage (such as S3) as the storage backend. The post describes the motivation for a new storage system, and how they've optimized the implementation for efficiency and cross language support.

https://medium.com/tiledb/tiledb-a-database-for-data-scientists-ddf4ca122176

Criteo writes about how they enforce data quality for the 450 PB that they have in their Hadoop clusters. Since Hadoop is very flexible/lenient in the data it supports, they run over 7,000 data quality checks per day using statistics from Hive tables and custom queries.

https://medium.com/criteo-labs/big-data-quality-at-criteo-66c6bd0d42d8

The morning paper covers Taiji, Facebook's system for routing and managing global traffic. Taiji takes advantage of knowledge about a user and their social connections to efficiently route traffic from the edge to a data center. By better taking advantage of caches, changes in routing materially reduce load on the database systems.

https://blog.acolyer.org/2019/11/15/facebook-taiji/

An in-depth look at how to the various Apache Kafka configurations for Producer and Consumers can be used to avoid duplicate data in your pipeline. This post describes the various types of delivery guarantees, how they map to Kafka settings, and discusses how to put them all together. The post has lots of great diagrams, and there's a bonus section on other strategies for deduplicating data in a stream.

https://medium.com/@andy.bryant/processing-guarantees-in-kafka-12dd2e30be0e


Events

Curated by Datadog ( http://www.datadog.com )

California

Apache Kafka and Immutable Infrastructure + An Introductory Kafka Talk! (Santa Monica) - Tuesday, November 19

https://www.meetup.com/Los-Angeles-Apache-Kafka-Meetup/events/265903100/

Learning How to Build-Event Streaming Applications with Pac-Man (San Diego) - Tuesday, November 19

https://www.meetup.com/San-Diego-Java-Users-Group/events/266167833/

Bay Area Apache Flink Meetup @ Cloudera (Palo Alto) - Wednesday, November 20

https://www.meetup.com/Bay-Area-Apache-Flink-Meetup/events/266226960/

Building Data Lineage, Data Orchestration, and Data Mesh (Mountain View) - Thursday, November 21

https://www.meetup.com/SF-Big-Analytics/events/265730920/

Washington

The Rise of Apache Flink and Stream Processing (Bellevue) - Wednesday, November 20

https://www.meetup.com/Big-Data-Bellevue-BDB/events/262650434/

Texas

Foreign-Key Joins with Kafka Streams (Plano) - Thursday, November 21

https://www.meetup.com/Dallas-Kafka/events/265880083/

Wisconsin

Introduction to Apache Spark (Madison) - Monday, November 18

https://www.meetup.com/Women-in-Big-Data-Wisconsin-Chapter/events/265857508/

IRELAND

Data Science and Engineering Club @ Zalando (Dublin) - Thursday, November 21

https://www.meetup.com/Data-Science-and-Engineering-Club/events/265937165/

FRANCE

Fine-Tuning Kafka: Let's Look under the Hood! (Paris) - Wednesday, November 20

https://www.meetup.com/PerfUG/events/254607871/

BELGIUM

Apache Beam Intro and Use Cases (Antwerpen) - Friday, November 22

https://www.meetup.com/Belgium-Apache-Beam-Meetup/events/264933301/

DENMARK

Kafka Is More ACID Than Your Database (Kongens Lyngby) - Tuesday, November 19

https://www.meetup.com/Copenhagen-Javagruppen-Meetup/events/266441150/

SWITZERLAND

Apache Beam @ Ricardo.ch + Portable Schema + BeamSQL (Zurich) - Tuesday, November 19

https://www.meetup.com/Zurich-Apache-Beam-Meetup/events/265529665/

CZECH REPUBLIC

Apache Kafka as a Database (Brno-stred) - Wednesday, November 20

https://www.meetup.com/Brno-Java-Meetup/events/265476406/

SINGAPORE

Event-Driven Microservices with CQRS Using Axon: Excitingly Boring (Singapore) - Wednesday, November 20

https://www.meetup.com/singajug/events/266381397/

AUSTRALIA

Melbourne Data Engineering Meetup (Docklands) - Wednesday, November 20

https://www.meetup.com/Melbourne-Data-Engineering-Meetup/events/265892291/

Data Eng Weekly #329

Some great stuff in this week’s issue—Lyft’s open source metadata and data discovery platform, how Netflix uses GraphQL to build search indexes, an open source backup tool for Apache Cassandra, coverage of the Ceph distributed file systems evolution, and several other posts about Apache Airflow, Apache Spark, CockroachDB, event-driven microservices, and more!


Lyft writes about Amudsen, their open-source tool for metadata management and data discovery. The post covers the architecture of the system—the metadata, search, and frontend services as well as the databuilder (data ingestion framework). Since open sourcing earlier this year, they've had a number of contributions, such as one to make the datastore more extensible (supporting Apache Atlas in addition to Neo4j).

https://eng.lyft.com/open-sourcing-amundsen-a-data-discovery-and-metadata-platform-2282bb436234

A look at how to use Docker Swarm to scale out Apache Airflow using a custom Airflow operator.

https://medium.com/analytics-vidhya/orchestrating-airflow-tasks-with-docker-swarm-69b5fb2723a7

One data engineer's learnings after a year on the job. There are some good reflections on tools, automation (after a workflow of a certain size, a tool like Airflow is important), monitoring, and metadata/documentation housekeeping.

https://medium.com/@lohmengxin/one-year-as-a-data-engineer-d713b511f0d4

Netflix describes how they use GraphQL to build indexes ofrom data stored across multiple services. They key idea is to do issue a (batch) GraphQL query to return a full denormalized record (e.g. a show and its episodes, etc.) and store the results in Elasticsearch. Next, they can listen to changes on Kafka, and follow the relationships from the GraphQL schema to invalidate/reindex data.

https://medium.com/netflix-techblog/graphql-search-indexing-334c92e0d8d5

Medusa is a new open source backup tool for Apache Cassandra. It stores backup data in cloud storage (e.g. Amazon S3 or Google Cloud Storage), and it creates smart incremental backups by taking advantage of the immutable nature of SSTables. There's much more details about the tools features and how to use it in this post on The Last Pickle blog.

https://thelastpickle.com/blog/2019/11/05/cassandra-medusa-backup-tool-is-open-source.html

An in-depth look at the Apache Kafka consumer rebalance protocol. The post describes the pieces of the protocol like JoinGroup, SyncGroup, Heartbeat, and LeaveGroup. It also looks at the recent additions of static membership and incremental cooperative rebalancing. There are lots of great diagrams to illustrate the key concepts.

https://medium.com/streamthoughts/apache-kafka-rebalance-protocol-or-the-magic-behind-your-streams-applications-e94baf68e4f2

A look at how to detect skew in your Apache Spark jobs, and several ways to fix a job with skew (hints, randomizing the join key, writing a custom partitioner). Which solution is best/fastest depends a bit on the inputs to your job.

https://www.davidmcginnis.net/post/spark-job-optimization-dealing-with-data-skew

The morning paper covers a paper on Ceph, the open-source distributed file system. Over the past few years, Ceph implemented a new store that bypasses a filesystemto better take advantage of SSD and HDD disks. The post describes the motivation of the changes, some of the other options they explored (including rocksdb), and the performance improvements they see with the new storage backend.

https://blog.acolyer.org/2019/11/06/ceph-evolution/

Cockroach Labs writes about how they've sped up distributed transactions with parallel commits, which avoids certain round trips across the WAN. The post describes the solution, including how failure handling works, and it shows that experimentally latency is cut in half.

https://www.cockroachlabs.com/blog/parallel-commits

This post describes why you should consider a relational database and consider taking advantage of advanced features (such as triggers and stored procedures). The author motivates based on experience working with jupyter notebooks and comparing the complexity of a NoSQL database like Mongo or Elasticsearch.

https://tselai.com/modern-data-practice-and-the-sql-tradition.html

This article describes an event-driven architecture for maintaining a CRM and realtime database. An interesting component of the post details how to implement an audit system to ensure that all microservices consume the events. The basic idea is tag events with a unique ID and each microservice generates an audit event with that ID as it processes the event. Alerts are generated when (after aggregating based on a time window) the number of audit events for a particular ID is too low.

https://medium.com/@vladimir.elvov/why-event-driven-is-businesss-best-friend-982749561024


Events

Curated by Datadog ( http://www.datadog.com )

California

Data-Driven Development in Autonomous Driving + Spark Performance Tuning (Mountain View) - Tuesday, November 12

https://www.meetup.com/SF-Big-Analytics/events/265162755/

Data Engineering Meetup (San Diego) - Thursday, November 14

https://www.meetup.com/Data-Engineering-San-Diego/events/266078459/

Texas

Mirror Maker 2.0 (Austin) - Tuesday, November 12

https://www.meetup.com/Austin-Apache-Kafka-Meetup-Stream-Data-Platform/events/266214573/

Virginia

NOVA Data Engineering: First Meetup! (Herndon) - Thursday, November 14

https://www.meetup.com/NOVA-Data-Engineering/events/265820975/

BRAZIL

Data Meetup (Sao Carlos) - Wednesday, November 13

https://www.meetup.com/opensanca/events/265658143/

SWEDEN

Streaming Processing with Hazelcast Jet and Kafka (Stockholm) - Tuesday, November 12

https://www.meetup.com/Knock-Data-Stockholm/events/266041648/

SPAIN

Everything You Need to Know about Kafka Streams (A Coruna) - Thursday, November 14

https://www.meetup.com/CorunaJUG/events/266199913/

FRANCE

Airflow @ SchoolMouv: Build, Schedule, and Monitor Pipelines at Scale (Toulouse) - Wednesday, November 13

https://www.meetup.com/Toulouse-Data-Engineering/events/264518964/

CZECH REPUBLIC

Making Apache Spark Better with Delta Lake (Prague) - Thursday, November 14

https://www.meetup.com/CS-HUG/events/266200104/

POLAND

QA in Beam + Beam Use Case + More! (Warsaw) - Thursday, November 14

https://www.meetup.com/Warsaw-Apache-Beam-Meetup/events/265212437/

GREECE

Timeseries Forecasting as a Service + Run Spark and Flink Jobs on Kubernetes (Athens) - Thursday, November 14

https://www.meetup.com/Athens-Big-Data/events/265957761/

ISRAEL

Airflow Demystified + Big Data Demystified (Tel Aviv-Yafo) - Sunday, November 17

https://www.meetup.com/Big-Data-Demystified/events/263781635/

CHINA

Kafka Beijing Meetup (Beijing) - Saturday, November 16

https://www.meetup.com/Beijing-Kafka-Meetup/events/266223400/

AUSTRALIA

Kafka Is More ACID Than Your Database (Sydney) - Wednesday, November 13

https://www.meetup.com/apache-kafka-sydney/events/266137580/

K8s Meetup with Instaclustr & Google! (Pyrmont) - Wednesday, November 13

https://www.meetup.com/Sydney-Kubernetes-User-Group/events/265679988/

Sydney Data Engineering Meetup (Sydney) - Thursday, November 14

https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/264743457/


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Eng Weekly #328

Short and sweet issue this week, with several new open source tools—Beekeeper for cleaning up unused data, the Mantis project for real-time operations, and pg_flame's flame graphs for analyzing postgres queries—as well as implementation articles covering Apache Airflow, Rust for Kafka, and using bloom filters to optimize GDPR data deletion.


Beekeeper is a new open-source tool from Expedia for cleaning up unreferenced data in Apache Hive tables. It listens for metastore changes (like changes to table or partition locations) and periodically deletes abandoned data in S3. The github repo has instructions for trying it out using docker.

https://medium.com/expedia-group-tech/introducing-beekeeper-35ab8b770f23

Netflix has open sourced Mantis, its operations-focused event analysis system. Mantis aims to provide cost-effective, event-based analysis of a live system. It does so by allowing applications to publish lots of events but only incur costs of those events if there's a subscriber of the event stream. At Netflix, they use Mantis for things like monitoring video streaming health, contextual alerting, measuring Cassandra health, and alerting on log events. The Mantis github page has lots more details about the system, including an overview of its architecture and details on how to try it out.

https://medium.com/netflix-techblog/open-sourcing-mantis-a-platform-for-building-cost-effective-realtime-operations-focused-5b8ff387813a

https://netflix.github.io/mantis/



A look at how the Devoted Health team uses Apache Airflow. They deploy on Kubernetes (using Terraform and Helm), which allows developers to get their own instance of the stack for development/testing. They've built a tool for standardized DAG development (using YAML definitions) and a dev tool for synchronizing code changes to Kubernetes. The post also describes how they write integration tests and validations, deploy, and monitor their deployment.

https://adam.boscarino.me/posts/airflow-at-devoted-health/

Another post on Airflow deployments, Lyft writes about how they've implemented fine-grained secure access to the Apache Airflow UI. While Airflow has built-in RBAC, they built a custom security manager that adds DAG-level access permissions (defined alongside the DAG). At Lyft, each team has its own RBAC role and can decide which teams have access to the DAGs they publish.

https://eng.lyft.com/securing-apache-airflow-ui-with-dag-level-access-a7bc649a2821

A post on the Confluent blog describes porting a non-trivial Kafka application from Clojure to Rust. The author describes the tradeoffs between the various Rust libraries for Kafka, how to extend a client to support Avro records/Schema Registry, and shares some benchmarks comparing Clojure and Rust performance/memory usage.

https://www.confluent.io/blog/getting-started-with-rust-and-kafka

The Adobe Experience team writes about how they use bloom filters to speed up deletions of user data, for requests like those in GDPR. By maintaining a bloom filter for each partition of a data set, they can prune large swaths of data that definitely don't contain records for a particular user when deleting.

https://medium.com/adobetech/search-optimization-for-large-data-sets-for-gdpr-7c2f52d4ea1f

`pg_flame` looks like a handy tool for visualizing the output of postgres' `EXPLAIN ANALYZE`. Mousing over an entry in the flame graph reveals more details about that step of the `EXPLAIN`.

https://github.com/mgartner/pg_flame

Events

Curated by Datadog ( http://www.datadog.com )

California

Deploying Kafka SSL at Uber Scale + What’s the Time & Why? (Palo Alto) - Tuesday, October 29

https://www.meetup.com/KafkaBayArea/events/265692862/

Arizona

Planning for the Varied Levels of Once-Data-Processing with Kafka (Scottsdale) - Tuesday, October 29

https://www.meetup.com/Kafka-Phoenix/events/265328130/

BRAZIL

Everything You Want to Know about Apache Kafka, but Were Afraid to Ask (Sao Paulo) - Wednesday, October 30

https://www.meetup.com/Sao-Paulo-Apache-Kafka-Meetup-by-Confluent/events/265512308/

UNITED KINGDOM

Building Data Products in Production (London) - Tuesday, October 29

https://www.meetup.com/Data-something/events/265560240/

FRANCE

2 Years in Prod with a Hadoop Cluster (Villeurbanne) - Tuesday, October 29

https://www.meetup.com/lyonjug/events/265614417/

GERMANY

Kafka at Sky Deutschland + Kafka on Kubernetes  (Unterfohring) - Tuesday, October 29

https://www.meetup.com/Apache-Kafka-Germany-Munich/events/265114406/

POLAND

Kafka Connect (Torun) - Wednesday, October 30

https://www.meetup.com/Torun-JUG/events/265934291/

ROMANIA

Cloud Engineering at GoPro (Bucharest) - Tuesday, October 29

https://www.meetup.com/GoPro-Software-Development-in-Bucharest/events/265377428/

RUSSIA

KSQL (Saint Petersburg) - Monday, October 28

https://www.meetup.com/St-Petersburg-Kafka/events/265846058/

INDIA

Stream Processing with Apache Kafka, Flink, and Spark (Bengaluru) - Saturday, November 2

https://www.meetup.com/Bangalore-Apache-Kafka-Group/events/265285812/

AUSTRALIA

ETL, Kafka, and Elasticsearch Combined! (Sydney) - Tuesday, October 29

https://www.meetup.com/Sydney-Alt-Net/events/264970509/

Data Eng Weekly #327

This week's issue covers quite the range of topics—Netflix's change data capture architecture, optimizing cloud costs at Segment, the Apache Arrow Flight protocol, Kubernetes operators/controllers, and Python concurrency. Also, a look at two new projects—MLflow Model Registry and DuckDB, an embedded columnar database engine.


The Apache Arrow blog writes about the new Arrow Flight protocol for sending data fast and efficiently (by sending data in Arrow format). The post goes into the motivation of Flight, describes some of the basics of a Flight server, describes how Flight builds on gRPC, and more. While it's still fairly early in the development process, Flight could prove to be really important for improving the efficiency of large scale data processing.

http://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/

A look at the `pg_prewarm` extension for prewarming the PostgreSQL cache, including how to enable it to run automatically on server startup.

https://www.cybertec-postgresql.com/en/prewarming-postgresql-i-o-caches/

Netflix writes about Delta, their system for shuffling data between systems using change data capture (CDC). They've built delta connectors for MySQL and Postgres that stream data to Apache Kafka. The post discusses their Kafka configurations and the stream processing framework (built on Apache Flink) that processes the CDC data and enriches it to build denormalized records.

https://medium.com/netflix-techblog/delta-a-data-synchronization-and-enrichment-platform-e82c36a79aee

The MLflow Model Registry is a new extension to the MLflow project that provides an API and Web UI for uploading and promoting machine learning models across environments. It has first-class notions of environments/lifecycle stages (e.g. to promote from staging to production), which makes it a good mach for CI/CD tooling.

https://databricks.com/blog/2019/10/17/introducing-the-mlflow-model-registry.html

To speed up your Python scripts, you can use multithreading or multiprocessing. This post provides shows how, if you write your code in a functional way, you can introduce parallelism with only a few changes. It demonstrates the ThreadPoolExecutor, ProcessPoolExecutor, and the tradeoffs between the two.

http://pljung.de/posts/easy-concurrency-in-python/

In Kubernetes, operators and controllers are pretty common for stateful systems or those otherwise dealing with data. Even if you're not building a Kubernetes controller yourself, this post that describes the differences between the two is a good introduction.

https://octetz.com/posts/k8s-controllers-vs-operators

Segment describes several optimizations that they made to improve their infrastructure costs. The changes are across all parts of the stack—from data systems to the javascript file that they're serving to customers. On the data front—they describe changes they made to their deployments of Apache Kafka (switching to instances with local storage) and NSQ (moving from a colocated model to a centralized cluster). They also made changes to minimize cross-AZ transfer costs—alterations to Kafka clients and service discovery to keep traffic inside of a single zone.

https://segment.com/blog/the-10m-engineering-problem/

DuckDB is a new embedded, columnar database optimized for analytics workloads. This post shows how to use it via Python bindings, and it compares performance with SQLite on a few queries.

https://uwekorn.com/2019/10/19/taking-duckdb-for-a-spin.html



Events

Curated by Datadog ( http://www.datadog.com )

New York

Kafka on Kubernetes: Just Because You Can, Doesn't Mean You Should! (New York) - Tuesday, October 22

https://www.meetup.com/NYC-Open-Data/events/263390404/

Massachusetts

Free Apache Kafka Workshop (Boston) - Tuesday, October 22

https://www.meetup.com/aittg-boston/events/264304883/

BRAZIL

10th Data Engineering Meetup (Belo Horizonte) - Wednesday, October 23

https://www.meetup.com/engenharia-de-dados/events/265772897/

UNITED KINGDOM

Kuberoo (London) - Thursday, October 24

https://www.meetup.com/Kubernetes-London/events/265617529/

SWEDEN

Trustly Duchess Meetup: Introduction to Apache Kafka and Reactive Java (Stockholm) - Wednesday, October 23

https://www.meetup.com/Duchess-Sweden/events/265555150/

SPAIN

Design Principles for an Event-Driven Architecture/Streaming with KSQL (Las Rozas de Madrid) - Thursday, October 24

https://www.meetup.com/Madrid-Kafka/events/265321681/

Extending Spark for Qbeast's SQL DataSource (Barcelona) - Thursday, October 24

https://www.meetup.com/Spark-Barcelona/events/265706465/

FRANCE

Data Engineering with Delta Lake, Pulsar, and Spark-Tools (Paris) - Tuesday, October 22

https://www.meetup.com/Paris-Data-Engineers/events/264819837/

GERMANY

Full Day Apache Cassandra & Kafka Workshop (Berlin) - Monday, October 21

https://www.meetup.com/Distributed-Data-Berlin/events/264890586/

FREE NOW Data Journey to Kafka (Hamburg) - Tuesday, October 22

https://www.meetup.com/Hamburg-Kafka/events/265207803/

Cassandra Meets Kafka at ApacheCon! (Berlin) - Wednesday, October 23

https://www.meetup.com/Berlin-Cassandra-Users/events/265707785/

Apache Kylin Meetup @ OLX (Berlin) - Thursday, October 24

https://www.meetup.com/Apache-Kylin-Meetup-Berlin/events/264945114/

POLAND

Rg-Dev #32 (Rzeszow) - Thursday, October 24

https://www.meetup.com/rg-dev/events/262422311/

AUSTRALIA

Sydney Data Engineering Meetup (Surry Hills) - Thursday, October 24

https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/262769526/

Fintech Production with Kafka Streams (Melbourne) - Thursday, October 24

https://www.meetup.com/melbourne-distributed/events/265013568/


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Data Eng Weekly #326

Back after another week off—so we've got the best articles from the past two weeks. Several interesting new things to checkout this week—Bigslice and Bigmachine from GRAIL, an interesting strategy for turning change data capture events into audit events on the Debezium blog, and the SLOG system that aims to provide low-latency and strict serializability for multi-region systems. Lots more good stuff—posts on data pipelines, a look at new features in PostgreSQL 12, and auto scaling for Apache Airflow.


GameChanger writes about how they've (mostly) automated loading of data from their data pipeline to the data warehouse. Some friction comes from defining the schema in the data warehouse. A new tool was written to create definitions based on the Avro Schema from the Confluent Schema Registry.

http://tech.gc.com/let-me-automate-that-for-you/

The kafkacat CLI tool can be used for quick to setup (but not production ready) replication between Kafka clusters/topics. This post describes how to invoke it, and what some of the caveats are.

https://rmoff.net/2019/09/29/copying-data-between-kafka-clusters-with-kafkacat/

The Debezium blog shares the details of implementing a fascinating technique for building an audit log using change data capture data. The general idea is to to populate a secondary table keyed on transaction id with the details of the JWT that was used to perform the transaction. Then, Apache Kafka Streams is used to join the data between CDC streams for those tables. The post dives into how to build out this type of system in full detail (e.g. lots of sample code showing how to build 1) a JAX-RS Interceptor to automatically populate the table based on the JWT and 2) the Kafka Streams application).

https://debezium.io/blog/2019/10/01/audit-logs-with-change-data-capture-and-stream-processing/

PostgreSQL 12 was released a little over a week ago. The announcement describes some of the features (lots of performance improvements), and a second post on pgdash.io describes a new feature, generated columns. There are some interesting use cases for these, such as normalizing text data for searches.

https://www.postgresql.org/about/news/1976/

https://pgdash.io/blog/postgres-12-generated-columns.html

GRAIL has open sourced Bigslice and Bigmachine, which enable distributed computation across large datasets using simple Golang programs. Unlike other big data tools, Bigslice spins up EC2 instances at runtime to distribute your computation. It exposes a high-level programming model (e.g. Map, Join, Filter) for batch processing. The introductory blog post and the github project have many more details, including how to get started (looks quite easy!)

https://medium.com/grail-eng/bigslice-a-cluster-computing-system-for-go-7e03acd2419b

I can't tell you how many times I've seen a syntax error because I tried to reference a table/column in a particular part of a SQL query. This post conveys when it's OK to cross-reference columns/tables defined in other components of a SQL query. There's a good cheat sheet if you might find it useful as a reference.

https://jvns.ca/blog/2019/10/03/sql-queries-don-t-start-with-select/

For those interested in distributed systems at global scale—this post dives into SLOG, which is new system designed to offer low latency and strict serializability by taking advantage of locality in client access patterns. The post gives a good introduction to the high-level intuition and system design, and if you want more the full VLDB paper is linked.

http://dbmsmusings.blogspot.com/2019/10/introducing-slog-cheating-low-latency.html

Facebook's scribe is a high throughput (2.5TB per second at peak) system for capturing log data. This post shares the high-level design of the system—covering topics like availability (e.g. buffering data to local disk in case of network issues), scalability, and multitenancy.

https://engineering.fb.com/data-infrastructure/scribe/

LinkedIn has open sourced the version of Apache Kafka that they run in production across thousands of brokers—they base on Apache Kafka release branches and add changes. The post talks about some of the improvements they've made, like better scalability by reusing UpdateMetadataRequest objects and a maintenance mode that makes it easier to cleanly take down a broker. The post also describes their development process, and how they integrate with the Apache Kafka upstream project.

https://engineering.linkedin.com/blog/2019/apache-kafka-trillion-messages

A look at how to ensure you're getting the best performance out of postgres (things like partial indexes and increasing the shared buffer cache) as well as some advanced features you might not have known about like text search, geospacial indexes, hstore for key/value data, and JSON/XML data types.

https://dev.to/heroku/postgres-is-underrated-it-handles-more-than-you-think-4ff3

A look at using the Kubernetes HorizontalPodAutoscaler to autoscale the workers of an Apache Airflow deployment. While the post has some details that are specific to Google Cloud Composer (a managed service for Apache Airflow), if you're interested in autoscaling your Airflow workers, this looks like a good place to get started.

https://medium.com/traveloka-engineering/enabling-autoscaling-in-google-cloud-composer-ac84d3ddd60

Convoy writes about how the improved the latency of data loads to their data warehouse using KafkaConnect to load JSON data from Postgres via Debezium to Snowflake. The post has lots of practical details on deploying a production pipeline of this style.

https://medium.com/convoy-tech/logs-offsets-near-real-time-elt-with-apache-kafka-snowflake-473da1e4d776


Events

Curated by Datadog ( http://www.datadog.com )

Colorado

Hadoop Rising: The Evolving Ecosystem (Boulder) - Thursday, October 17 

https://www.meetup.com/Boulder-Denver-Big-Data/events/265388338/

Illinois

Full-Day Apache Cassandra and Kafka Workshop (Chicago) - Thursday, October 17 

https://www.meetup.com/Chicago-SQL/events/263999278/

Ohio

Kafka & KSQL (Columbus) - Tuesday, October 15 

https://www.meetup.com/MODUG-Mid-Ohio-Data-User-Group/events/265149919/

UNITED KINGDOM

Apache Beam Meetup 8: Streaming SQL in Beam + Beam Use Case by Huq Industries (London) - Wednesday, October 16 

https://www.meetup.com/London-Apache-Beam-Meetup/events/263701679/

FRANCE

Apache Beam Meetup 2: Portability, Beam on Spark, and More! (Paris) - Thursday, October 17 

https://www.meetup.com/Paris-Apache-Beam-Meetup/events/264545288/

GERMANY

Berlin AWS Group Meetup (Berlin) - Tuesday, October 15 

https://www.meetup.com/aws-berlin/events/258598000/

Apache Kafka at Deutsche Bahn & Confluent Cloud (Frankfurt) - Wednesday, October 16 

https://www.meetup.com/Frankfurt-Apache-Kafka-Meetup-by-Confluent/events/264399318/

AUSTRIA

Managing Data Flows: Apache NiFi Deep Dive + Streaming Use Cases (Vienna) - Thursday, October 17

https://www.meetup.com/futureofdata-vienna/events/264352692/

POLAND

First Warsaw Airflow Meetup (Warsaw) - Thursday, October 17 

https://www.meetup.com/Warsaw-Airflow-Meetup/events/264867971/

KENYA

MQTT and Apache Kafka: A Case Study of Uchumi Commercial Bank-Tanzania (Nairobi) - Saturday, October 19 

https://www.meetup.com/nairobi-jvm/events/265236973/

INDIA

Open Source Technologies at Expedia (Bangalore) - Wednesday, October 16

https://www.meetup.com/Data-Surfers/events/264806373/

SINGAPORE

Apache Kafka and Microservices (Singapore) - Thursday, October 17 

https://www.meetup.com/Singapore-Kafka-Meetup/events/265194260/

AUSTRALIA

Viktor Gamov and George Hall Talk Kafka, Kubernetes, Connectors, and Operator (Docklands) - Tuesday, October 15 

https://www.meetup.com/KafkaMelbourne/events/264970023/

Kafka on Kubernetes: Does It Really Have to Be “The Hard Way”? (Sydney) - Thursday, October 17 

https://www.meetup.com/apache-kafka-sydney/events/265104559/

FinTech Production with Kafka Streams (Melbourne) - Thursday, October 17 

https://www.meetup.com/melbourne-distributed/events/263797130/


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.

Loading more posts…