Compute-based sizing and system dashboard

Compute-based sizing and system dashboard

Thursday, June 21
11:30 AM - 12:10 PM
Executive Ballroom 210A/E

HP operates a very complex HDP environment with key stakeholders and critical data across a variety of business areas: finance, supply chain, sales, and customer support. We load over 8,000 files per day, execute 1.5M lines of SQL via 6000 jobs running against 637B rows of data comprising over 5000 tables in 77 domains. Needless to say, defining our cluster size and monitoring job performance is essential for our success and the satisfaction of our stakeholders across the different business and IT organizations.

In this talk, we will describe the different sizing and allocation approaches that we went through. Our first method was a bottom-up storage-based calculation which took into account the legacy data, replication factors, overhead, and user space requirements. We quickly realized the current compute would not meet the needs of the follow-up phases of the project and that the bottom-up approach had too many assumptions and limitations.

The second method was to work top down to determine how many jobs could run with a set number of hours. This required us to calculate the number of slots for map and reduce tasks within set amount of YARN memory. To support this analysis, we developed advanced dashboards and reports that we will also share during the presentation. We captured statistics for every job and calculated the average map and reduce times. With this information, we could then calculate needed compute and storage to meet the required SLAs. And the result, the cluster grew by 88 nodes and now operates with 21 TB of YARN memory.

Presentation Video


Janet Li
Big Data IT Manager
HP Inc
Janet Li has over 15 years' experience in the IT industry in the areas of databases, analytics & big data. Janet has managed internal and external teams of architects, database administrators, big data architect & infrastructure providers for large IT projects. Most recently Janet is the lead of the Hortonworks Hadoop Data Lake for HP Inc. Janet has a bachelor’s degree from the University of Wuhan University, China and a Master's Degree in Computer Science from the University of St. Thomas, MN, US. Janet is based in Austin Texas where she enjoys spending time with her family and hiking with her dog in the Texas Hill Country.
Pranay Vyas
Sr. Consultant
Pranay is an accomplished Hadoop Architect and Engineer, with hands-on development with Hadoop technologies, includes Installation, maintenance and upgrade of HDP cluster & development using Pig, Hive, Spark, SolR, HBase, Flume, Storm. Pranay has over 12+ years of experience with multiple technologies including server administration, Java Technologies, .NET technologies and Mainframe applications.