An architecture for federated data discovery and lineage over on-prem datasources and public cloud with Apache Atlas

An architecture for federated data discovery and lineage over on-prem datasources and public cloud with Apache Atlas

Tuesday, June 19
2:50 PM - 3:30 PM
Grand Ballroom 220B

Comcast's Streaming Data platform comprises a variety of ingest, transformation, and storage services in the public cloud. Peer-reviewed Apache Avro schemas support end-to-end data governance. We have previously reported (DataWorks Summit 2017) on how we extended Atlas with custom entity and process types for discovery and lineage in the AWS public cloud. Custom lambda functions notify Atlas of creation of new entities and new lineage links via asynchronous kafka messaging.

Recently we were presented the challenge of providing integrated data discovery and lineage across our public cloud datasources and on-prem datasources, both Hadoop-based and traditional data warehouses and RDBMSs. Can Apache Atlas meet this challenge? A resounding yes! This talk will present our federated architecture, with Atlas providing SQL-like, free-text, and graph search across select metadata from all on-prem and public cloud data sources in our purview. Lightweight, custom connectors/bridges identify metadata/lineage changes in underlying sources and publish them to Atlas via the asynchronous API. A portal layer provides Atlas query access and a federation of UIs. Once data of interest is identified via Atlas queries, interfaces specific to underlying sources may be used for special-purpose metadata mining.

While metadata repositories for data discovery and lineage abound, none of them have built-in connectors and listeners for the entire complement of data sources that Comcast and many other large enterprises use to support their business needs. In-house-built solutions typically underestimate the cost of development and maintenance and often suffer from architecture-by-accretion. Atlas' commitment to extensibility, built-in provision of typed, free-text, and graph search, and REST and asynchronous APIs, position it uniquely in the build-vs-buy sweet spot.

Presentation Video


Barbara Eckman
Principal Software Architect
Barbara Eckman is a Principal Software Architect at Comcast. She leads data governance for an innovative, division-wide initiative comprising near-real-time ingesting, streaming, transforming, storing, and analyzing Big Data. Barbara is a recognized technical innovator in Big Data architecture and governance, as well as scientific data and model integration. Her experience includes technical leadership positions at a Human Genome Project Center, Merck, GlaxoSmithKline, and IBM. She served on the IBM Academy of Technology, an internal peer-elected organization akin to the National Academy of Sciences.