HP operates a very complex HDP environment with key stakeholders and critical data across a variety of business areas: finance, supply chain, sales, and customer support. We load over 8,000 files per day, execute 1.5M lines of SQL via 6000 jobs running against 637B rows of data comprising over 5000 tables in 77 domains. Needless to say, defining our cluster size and monitoring job performance is essential for our success and the satisfaction of our stakeholders across the different business and IT organizations.
In this talk, we will describe the different sizing and allocation approaches that we went through. Our first method was a bottom-up storage-based calculation which took into account the legacy data, replication factors, overhead, and user space requirements. We quickly realized the current compute would not meet the needs of the follow-up phases of the project and that the bottom-up approach had too many assumptions and limitations.
The second method was to work top down to determine how many jobs could run with a set number of hours. This required us to calculate the number of slots for map and reduce tasks within set amount of YARN memory. To support this analysis, we developed advanced dashboards and reports that we will also share during the presentation. We captured statistics for every job and calculated the average map and reduce times. With this information, we could then calculate needed compute and storage to meet the required SLAs. And the result, the cluster grew by 88 nodes and now operates with 21 TB of YARN memory.