Marmaray: Uber’s Open-sourced Generic Hadoop Data Ingestion and Dispersal Framework

Marmaray: Uber's Open-sourced Generic Hadoop Data Ingestion and Dispersal Framework

Wednesday, May 22
11:00 AM - 11:40 AM
Marquis Salon 9

Marmaray is Uber’s general-purpose Apache Hadoop data ingestion and dispersal framework and library that was open-sourced in 2018. Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark. The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference.

Many data users (e.g., Uber Eats and Uber’s machine learning platform, Michelangelo) use Hadoop in concert with other tools to build and train their machine learning models to ultimately produce derived datasets of immense additional value to drive Uber’s business toward greater efficiency and profitability. In order to maximize the usefulness of these derived datasets, the need arose to disperse this data to online datastores, often with much lower latency semantics than what existed in the Hadoop ecosystem in order to serve live traffic.

Before we introduced Marmaray, each team was building their own ad-hoc dispersal systems. This duplication of efforts and creation of esoteric, non-universal features generally led to an inefficient use of engineering resources. Marmaray was envisioned, designed, and ultimately released in late 2017 to fulfill the need for a flexible, universal dispersal platform that would complete the Hadoop ecosystem by providing the means to transfer Hadoop data out to any online data store.

Along the same lines, Uber’s business needs necessitated the ingestion of raw data from a variety of data sources into its Hadoop data lake, which required running and maintaining multiple data pipelines in production. This proved to be cumbersome over time, as the size of the data increased proportionally with Uber’s business growth. Our previous data architecture required running and maintaining multiple data pipelines, each corresponding to a different production codebase, which proved to be cumbersome over time as the amount of data increased. Data sources such as MySQL, Kafka, and Schemaless contained raw data that needed to be ingested into Hive to support diverse analytical needs from teams across the company. Each data source required understanding a different codebase and its associated intricacies as well as a different and unique set of configurations, graphs, and alerts. Adding new ingestion sources became non-trivial, and the overhead for maintenance required that our Big Data ecosystem support all of these systems. The on-call burden could be suffocating, with sometimes more than 200 alerts per week.

You’ll learn how the Marmaray team built and designed a common set of abstractions to handle both the ingestion and dispersal use cases, the challenges and lessons learned both from developing the core library and setting up an on-demand self-service workflow, and how the team leveraged Apache Spark to ensure the platform can scale to handle Uber’s growing data needs.

We'll also describe the business impact that Marmaray has had in use cases related to some of our fasting growing business verticals like Uber Eats and Uber Freight. Finally, will also explain how this framework helped Uber meet its GDPR requirements.

For more information, please see our Uber Engineering blog post here:

Presentation Video


Danny Chen
Engineering Manager
I am currently a Engineering Manager at Uber where I am a member of the Hadoop Platform team working on large scale data ingestion and dispersal pipelines and libraries leveraging Apache Spark. I was also previously the tech lead on the metrics team at Uber Maps building data pipelines to produce metrics to help analyze the quality of our mapping data. Before joining Uber, I worked at Twitter as an original member of the Core Storage team building Manhattan, a key/value store powering Twitter's use cases. I love learning anything about storage and data platforms and distributed systems at scale.