Data Eng Weekly #329

Some great stuff in this week’s issue—Lyft’s open source metadata and data discovery platform, how Netflix uses GraphQL to build search indexes, an open source backup tool for Apache Cassandra, coverage of the Ceph distributed file systems evolution, and several other posts about Apache Airflow, Apache Spark, CockroachDB, event-driven microservices, and more!


Lyft writes about Amudsen, their open-source tool for metadata management and data discovery. The post covers the architecture of the system—the metadata, search, and frontend services as well as the databuilder (data ingestion framework). Since open sourcing earlier this year, they've had a number of contributions, such as one to make the datastore more extensible (supporting Apache Atlas in addition to Neo4j).

https://eng.lyft.com/open-sourcing-amundsen-a-data-discovery-and-metadata-platform-2282bb436234

A look at how to use Docker Swarm to scale out Apache Airflow using a custom Airflow operator.

https://medium.com/analytics-vidhya/orchestrating-airflow-tasks-with-docker-swarm-69b5fb2723a7

One data engineer's learnings after a year on the job. There are some good reflections on tools, automation (after a workflow of a certain size, a tool like Airflow is important), monitoring, and metadata/documentation housekeeping.

https://medium.com/@lohmengxin/one-year-as-a-data-engineer-d713b511f0d4

Netflix describes how they use GraphQL to build indexes ofrom data stored across multiple services. They key idea is to do issue a (batch) GraphQL query to return a full denormalized record (e.g. a show and its episodes, etc.) and store the results in Elasticsearch. Next, they can listen to changes on Kafka, and follow the relationships from the GraphQL schema to invalidate/reindex data.

https://medium.com/netflix-techblog/graphql-search-indexing-334c92e0d8d5

Medusa is a new open source backup tool for Apache Cassandra. It stores backup data in cloud storage (e.g. Amazon S3 or Google Cloud Storage), and it creates smart incremental backups by taking advantage of the immutable nature of SSTables. There's much more details about the tools features and how to use it in this post on The Last Pickle blog.

https://thelastpickle.com/blog/2019/11/05/cassandra-medusa-backup-tool-is-open-source.html

An in-depth look at the Apache Kafka consumer rebalance protocol. The post describes the pieces of the protocol like JoinGroup, SyncGroup, Heartbeat, and LeaveGroup. It also looks at the recent additions of static membership and incremental cooperative rebalancing. There are lots of great diagrams to illustrate the key concepts.

https://medium.com/streamthoughts/apache-kafka-rebalance-protocol-or-the-magic-behind-your-streams-applications-e94baf68e4f2

A look at how to detect skew in your Apache Spark jobs, and several ways to fix a job with skew (hints, randomizing the join key, writing a custom partitioner). Which solution is best/fastest depends a bit on the inputs to your job.

https://www.davidmcginnis.net/post/spark-job-optimization-dealing-with-data-skew

The morning paper covers a paper on Ceph, the open-source distributed file system. Over the past few years, Ceph implemented a new store that bypasses a filesystemto better take advantage of SSD and HDD disks. The post describes the motivation of the changes, some of the other options they explored (including rocksdb), and the performance improvements they see with the new storage backend.

https://blog.acolyer.org/2019/11/06/ceph-evolution/

Cockroach Labs writes about how they've sped up distributed transactions with parallel commits, which avoids certain round trips across the WAN. The post describes the solution, including how failure handling works, and it shows that experimentally latency is cut in half.

https://www.cockroachlabs.com/blog/parallel-commits

This post describes why you should consider a relational database and consider taking advantage of advanced features (such as triggers and stored procedures). The author motivates based on experience working with jupyter notebooks and comparing the complexity of a NoSQL database like Mongo or Elasticsearch.

https://tselai.com/modern-data-practice-and-the-sql-tradition.html

This article describes an event-driven architecture for maintaining a CRM and realtime database. An interesting component of the post details how to implement an audit system to ensure that all microservices consume the events. The basic idea is tag events with a unique ID and each microservice generates an audit event with that ID as it processes the event. Alerts are generated when (after aggregating based on a time window) the number of audit events for a particular ID is too low.

https://medium.com/@vladimir.elvov/why-event-driven-is-businesss-best-friend-982749561024


Events

Curated by Datadog ( http://www.datadog.com )

California

Data-Driven Development in Autonomous Driving + Spark Performance Tuning (Mountain View) - Tuesday, November 12

https://www.meetup.com/SF-Big-Analytics/events/265162755/

Data Engineering Meetup (San Diego) - Thursday, November 14

https://www.meetup.com/Data-Engineering-San-Diego/events/266078459/

Texas

Mirror Maker 2.0 (Austin) - Tuesday, November 12

https://www.meetup.com/Austin-Apache-Kafka-Meetup-Stream-Data-Platform/events/266214573/

Virginia

NOVA Data Engineering: First Meetup! (Herndon) - Thursday, November 14

https://www.meetup.com/NOVA-Data-Engineering/events/265820975/

BRAZIL

Data Meetup (Sao Carlos) - Wednesday, November 13

https://www.meetup.com/opensanca/events/265658143/

SWEDEN

Streaming Processing with Hazelcast Jet and Kafka (Stockholm) - Tuesday, November 12

https://www.meetup.com/Knock-Data-Stockholm/events/266041648/

SPAIN

Everything You Need to Know about Kafka Streams (A Coruna) - Thursday, November 14

https://www.meetup.com/CorunaJUG/events/266199913/

FRANCE

Airflow @ SchoolMouv: Build, Schedule, and Monitor Pipelines at Scale (Toulouse) - Wednesday, November 13

https://www.meetup.com/Toulouse-Data-Engineering/events/264518964/

CZECH REPUBLIC

Making Apache Spark Better with Delta Lake (Prague) - Thursday, November 14

https://www.meetup.com/CS-HUG/events/266200104/

POLAND

QA in Beam + Beam Use Case + More! (Warsaw) - Thursday, November 14

https://www.meetup.com/Warsaw-Apache-Beam-Meetup/events/265212437/

GREECE

Timeseries Forecasting as a Service + Run Spark and Flink Jobs on Kubernetes (Athens) - Thursday, November 14

https://www.meetup.com/Athens-Big-Data/events/265957761/

ISRAEL

Airflow Demystified + Big Data Demystified (Tel Aviv-Yafo) - Sunday, November 17

https://www.meetup.com/Big-Data-Demystified/events/263781635/

CHINA

Kafka Beijing Meetup (Beijing) - Saturday, November 16

https://www.meetup.com/Beijing-Kafka-Meetup/events/266223400/

AUSTRALIA

Kafka Is More ACID Than Your Database (Sydney) - Wednesday, November 13

https://www.meetup.com/apache-kafka-sydney/events/266137580/

K8s Meetup with Instaclustr & Google! (Pyrmont) - Wednesday, November 13

https://www.meetup.com/Sydney-Kubernetes-User-Group/events/265679988/

Sydney Data Engineering Meetup (Sydney) - Thursday, November 14

https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/264743457/


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.