An architecture for federated data discovery and lineage over on-prem datasources and public cloud with Apache Atlas

An architecture for federated data discovery and lineage over on-prem datasources and public cloud with Apache Atlas

Tuesday, June 19
2:50 PM - 3:30 PM
Grand Ballroom 220B

Comcast's Streaming Data platform comprises a variety of ingest, transformation, and storage services in the public cloud. Peer-reviewed Apache Avro schemas support end-to-end data governance. We have previously reported (DataWorks Summit 2017) on how we extended Atlas with custom entity and process types for discovery and lineage in the AWS public cloud. Custom lambda functions notify Atlas of creation of new entities and new lineage links via asynchronous kafka messaging.

Recently we were presented the challenge of providing integrated data discovery and lineage across our public cloud datasources and on-prem datasources, both Hadoop-based and traditional data warehouses and RDBMSs. Can Apache Atlas meet this challenge? A resounding yes! This talk will present our federated architecture, with Atlas providing SQL-like, free-text, and graph search across select metadata from all on-prem and public cloud data sources in our purview. Lightweight, custom connectors/bridges identify metadata/lineage changes in underlying sources and publish them to Atlas via the asynchronous API. A portal layer provides Atlas query access and a federation of UIs. Once data of interest is identified via Atlas queries, interfaces specific to underlying sources may be used for special-purpose metadata mining.

While metadata repositories for data discovery and lineage abound, none of them have built-in connectors and listeners for the entire complement of data sources that Comcast and many other large enterprises use to support their business needs. In-house-built solutions typically underestimate the cost of development and maintenance and often suffer from architecture-by-accretion. Atlas' commitment to extensibility, built-in provision of typed, free-text, and graph search, and REST and asynchronous APIs, position it uniquely in the build-vs-buy sweet spot.

Presentation Video


Barbara Eckman
Senior Principal Software Architect
Barbara Eckman is a Senior Principal Software Architect in Customer Experience Technologies at Comcast. She is Lead Architect for data discovery and lineage for a division-wide platform comprising streaming, transforming, storing, governing, and analyzing Big Data. Barbara is also the Lead Metadata Architect for the Comcast Privacy Program, an initiative tackling the challenge of legislation like the California Consumer Privacy Act. Her prior experience includes scientific data and model integration at the Human Genome Project, Merck, GlaxoSmithKline, and IBM, where she served on the peer-elected IBM Academy of Technology.