NameNode Analytics – Scouting the HDFS Metadata

NameNode Analytics – Scouting the HDFS Metadata

Wednesday, June 20
4:40 PM - 5:30 PM
Meeting Room 211A/B/C/D

Today, the state of the art of observing HDFS metadata changes lends itself to only a couple of architectures. Typically making use of the legacy FsImage, OfflineImageViewer to parse it to plaintext, and either ElasticSearch, Kibana, or maybe HDFSDU for viewing. Some even run scripts that either perform status listings or counts on various target directories against the active NameNode.

These approaches tend to lend themselves to long process times and only being able to look at a snapshot view. Even worse, they tend to put additional read load on the Standby and Active NameNodes.
NameNode images for our largest clusters at PayPal can take hours to complete, parse, and generate reports. Many times damage was already done by the time we got those reports. Thus, we strove to find a way to graph HDFS usage, by user, by directory, or nearly any way we wanted, but in much closer to real time.

So, we decided to create a new tool and a new NameNode. It is a Standby NameNode, with no RPC Server (so no client or DataNodes can connect to it), and a custom query engine on top of it, accessible by REST API. It stays up to date through fetching JournalNode edits batches just like the real Standby. With this new NameNode, which we call NNA internally, we are able to generate reports much more quickly, about once every 30 minutes, that give great insight into directory usage, their growth, user usage growth, quota usage, etc. Even the ability to define very precise searches across the entire NameNode.

While this is very much still an incomplete project and needs several improvements, it has already helped us immensely within PayPal to better graph in real time how our HDFS users are behaving and able to see how HDFS activity looks within even as small a window as a single day.


Plamen Jeliazkov
Senior Hadoop Engineer
Plamen J. Jeliazkov is has been an HDFS contributor for about 6 years and considered an expert by his peers. He specializes in HDFS knowledge, most notably behind the NameNode internals. He was part of the team that brought truncate functionality to Hadoop. He is currently a senior Hadoop engineer at PayPal. His excitement comes in shining and polishing HDFS clusters to work at their best. From his Twitter: "Programmer. Gamer. Nerd. UCSD alumni. I develop Hadoop and HBase. I like computer systems, video games, and crypto."
Russ McElroy
Sr. Manager, Big Data Platform
PayPal Inc.
Russ McElroy is the Big Data Platform Manager at PayPal responsible for availability, architecture, strategy and vision for PayPal's Big Data assets including Hadoop, Druid and others. He has 20+ years in e-commerce and payments DevOps with a special interest in distributed systems. In previous roles, he took on eBay's scaling challenges in transaction databases and storage, search, analytics and cloud environments. His current focus is on enhancing PayPal's data governance capabilities by gaining greater insight into analytics metadata (analytics on analytics) while leveraging and providing open source options. He has a B.S. in Computer Science and Engineering from UC Davis.