Data Eng Weekly #329
Some great stuff in this week’s issue—Lyft’s open source metadata and data discovery platform, how Netflix uses GraphQL to build search indexes, an open source backup tool for Apache Cassandra, coverage of the Ceph distributed file systems evolution, and several other posts about Apache Airflow, Apache Spark, CockroachDB, event-driven microservices, and more!
Lyft writes about Amudsen, their open-source tool for metadata management and data discovery. The post covers the architecture of the system—the metadata, search, and frontend services as well as the databuilder (data ingestion framework). Since open sourcing earlier this year, they've had a number of contributions, such as one to make the datastore more extensible (supporting Apache Atlas in addition to Neo4j).
https://eng.lyft.com/open-sourcing-amundsen-a-data-discovery-and-metadata-platform-2282bb436234
A look at how to use Docker Swarm to scale out Apache Airflow using a custom Airflow operator.
https://medium.com/analytics-vidhya/orchestrating-airflow-tasks-with-docker-swarm-69b5fb2723a7
One data engineer's learnings after a year on the job. There are some good reflections on tools, automation (after a workflow of a certain size, a tool like Airflow is important), monitoring, and metadata/documentation housekeeping.
https://medium.com/@lohmengxin/one-year-as-a-data-engineer-d713b511f0d4
Netflix describes how they use GraphQL to build indexes ofrom data stored across multiple services. They key idea is to do issue a (batch) GraphQL query to return a full denormalized record (e.g. a show and its episodes, etc.) and store the results in Elasticsearch. Next, they can listen to changes on Kafka, and follow the relationships from the GraphQL schema to invalidate/reindex data.
https://medium.com/netflix-techblog/graphql-search-indexing-334c92e0d8d5
Medusa is a new open source backup tool for Apache Cassandra. It stores backup data in cloud storage (e.g. Amazon S3 or Google Cloud Storage), and it creates smart incremental backups by taking advantage of the immutable nature of SSTables. There's much more details about the tools features and how to use it in this post on The Last Pickle blog.
https://thelastpickle.com/blog/2019/11/05/cassandra-medusa-backup-tool-is-open-source.html
An in-depth look at the Apache Kafka consumer rebalance protocol. The post describes the pieces of the protocol like JoinGroup, SyncGroup, Heartbeat, and LeaveGroup. It also looks at the recent additions of static membership and incremental cooperative rebalancing. There are lots of great diagrams to illustrate the key concepts.
A look at how to detect skew in your Apache Spark jobs, and several ways to fix a job with skew (hints, randomizing the join key, writing a custom partitioner). Which solution is best/fastest depends a bit on the inputs to your job.
https://www.davidmcginnis.net/post/spark-job-optimization-dealing-with-data-skew
The morning paper covers a paper on Ceph, the open-source distributed file system. Over the past few years, Ceph implemented a new store that bypasses a filesystemto better take advantage of SSD and HDD disks. The post describes the motivation of the changes, some of the other options they explored (including rocksdb), and the performance improvements they see with the new storage backend.
https://blog.acolyer.org/2019/11/06/ceph-evolution/
Cockroach Labs writes about how they've sped up distributed transactions with parallel commits, which avoids certain round trips across the WAN. The post describes the solution, including how failure handling works, and it shows that experimentally latency is cut in half.
https://www.cockroachlabs.com/blog/parallel-commits
This post describes why you should consider a relational database and consider taking advantage of advanced features (such as triggers and stored procedures). The author motivates based on experience working with jupyter notebooks and comparing the complexity of a NoSQL database like Mongo or Elasticsearch.
https://tselai.com/modern-data-practice-and-the-sql-tradition.html
This article describes an event-driven architecture for maintaining a CRM and realtime database. An interesting component of the post details how to implement an audit system to ensure that all microservices consume the events. The basic idea is tag events with a unique ID and each microservice generates an audit event with that ID as it processes the event. Alerts are generated when (after aggregating based on a time window) the number of audit events for a particular ID is too low.
https://medium.com/@vladimir.elvov/why-event-driven-is-businesss-best-friend-982749561024
Events
Curated by Datadog ( http://www.datadog.com )
California
Data-Driven Development in Autonomous Driving + Spark Performance Tuning (Mountain View) - Tuesday, November 12
https://www.meetup.com/SF-Big-Analytics/events/265162755/
Data Engineering Meetup (San Diego) - Thursday, November 14
https://www.meetup.com/Data-Engineering-San-Diego/events/266078459/
Texas
Mirror Maker 2.0 (Austin) - Tuesday, November 12
https://www.meetup.com/Austin-Apache-Kafka-Meetup-Stream-Data-Platform/events/266214573/
Virginia
NOVA Data Engineering: First Meetup! (Herndon) - Thursday, November 14
https://www.meetup.com/NOVA-Data-Engineering/events/265820975/
BRAZIL
Data Meetup (Sao Carlos) - Wednesday, November 13
https://www.meetup.com/opensanca/events/265658143/
SWEDEN
Streaming Processing with Hazelcast Jet and Kafka (Stockholm) - Tuesday, November 12
https://www.meetup.com/Knock-Data-Stockholm/events/266041648/
SPAIN
Everything You Need to Know about Kafka Streams (A Coruna) - Thursday, November 14
https://www.meetup.com/CorunaJUG/events/266199913/
FRANCE
Airflow @ SchoolMouv: Build, Schedule, and Monitor Pipelines at Scale (Toulouse) - Wednesday, November 13
https://www.meetup.com/Toulouse-Data-Engineering/events/264518964/
CZECH REPUBLIC
Making Apache Spark Better with Delta Lake (Prague) - Thursday, November 14
https://www.meetup.com/CS-HUG/events/266200104/
POLAND
QA in Beam + Beam Use Case + More! (Warsaw) - Thursday, November 14
https://www.meetup.com/Warsaw-Apache-Beam-Meetup/events/265212437/
GREECE
Timeseries Forecasting as a Service + Run Spark and Flink Jobs on Kubernetes (Athens) - Thursday, November 14
https://www.meetup.com/Athens-Big-Data/events/265957761/
ISRAEL
Airflow Demystified + Big Data Demystified (Tel Aviv-Yafo) - Sunday, November 17
https://www.meetup.com/Big-Data-Demystified/events/263781635/
CHINA
Kafka Beijing Meetup (Beijing) - Saturday, November 16
https://www.meetup.com/Beijing-Kafka-Meetup/events/266223400/
AUSTRALIA
Kafka Is More ACID Than Your Database (Sydney) - Wednesday, November 13
https://www.meetup.com/apache-kafka-sydney/events/266137580/
K8s Meetup with Instaclustr & Google! (Pyrmont) - Wednesday, November 13
https://www.meetup.com/Sydney-Kubernetes-User-Group/events/265679988/
Sydney Data Engineering Meetup (Sydney) - Thursday, November 14
https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/264743457/
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.