Exploiting machine learning to keep Hadoop clusters healthy

Exploiting machine learning to keep Hadoop clusters healthy

Wednesday, June 20
4:40 PM - 5:30 PM
Executive Ballroom 210B/F

Oath has one of the largest footprint of Hadoop, with tens of thousands of jobs run every day. Reliability and consistency is the key here. With 50k+ nodes there will be considerable amount of nodes having disk, memory, network, and slowness issues. If we have any hosts with issues serving/running jobs can increase tight SLA bound jobs’ run times exponentially and frustrate users and support team to debug it.

We are constantly working to develop system that works in tandem with Hadoop to quickly identify and single out pressure points. Here we would like to concentrate on disk, as per our experience disk are the most trouble maker and fragile, specially the high density disks. Because of the huge scale and monetary impact because of slow performing disks, we took challenge to build system to predict and take worn-out disks before they become performance bottleneck and hit jobs’ SLAs. Now task is simple look into symptoms of hard drive failure and take them out? Right? No it’s not straight forward when we are talking about 200+k disk drives. Just collecting such huge data periodically and reliably is one of the small challenges as compared to analyzing such huge datasets and predicting bad disks. Now lets see data regarding each disk we have reallocated sectors count, reported uncorrectable errors, command timeout, and uncorrectable sector count. On top of it hard disk model has its own interpretation of the above-mentioned statistics.

How can one predict if a disk is going to fail based on the symptoms or the system stats collected? We choose machine learning to accomplish it. We can divided entire system into broadly 4 sub systems:
1. Dataset collection : This involved getting the disk performance stats, counters, and logs for thousands of disk drives from thousands of servers and storing them at a centralized location. We used Elastic Stack to collect the data at a centralized location.
2. Clean and preprocess the data : Cleaning involves extracting necessary data from the data collected and preprocessing involves converting the extracted data into a required format for further processing.
3. Cluster and labeling the data (using domain knowledge) : Once the data has been preprocessed, we are required to find the datasets which are similar to each other by clustering them. Why do we need clustering? The data collected in the raw format is a dump. To derive meaning out of the data, we need to try and group similar data points together and non-similar data points away. Self-organizing maps is the clustering technique used to cluster the data. Once the data has been grouped together, it requires labeling or segregating the data point as “good disk,” or “going to be bad,” or “bad disk.” Labelling requires domain knowledge.
4. Train the clustered data using artificial neural networks : Artificial neural network is a supervised machine learning classifier which is the main algorithm used for prediction. Initially, we pass the labeled data to train the neural network model Once the model is trained, we use the trained model to predict if the disk is good/ going to fail/ bad by passing new unlabeled dataset. The efficiency of this model depends on various factors. The accuracy of the labeled data, learning/training parameters, to name a few.

Presentation Video


Dheeraj Kapur
Principal Engineer
Principal Engineer with 12+ years of IT experience including experience in the areas of Cloud Computing, advance system automation & tools design and System administration. Skilled in management of infrastructure and implementing technology to support large user groups, supporting users at corporate headquarters as well as multiple remote locations, and effectively managing high end Hadoop Clusters. Build, Manage and Support hadoop clusters with Thousands of nodes and petabytes of data, running Hadoop distributed file system (HDFS) and map-reduce framework. Setup Automation framework for Management of user access and data on clusters. Automated Management of Yahoo clusters and Break-fixing of Bad nodes. Completely automated solution for 40+ clusters/45k nodes spanned across 3 colos.
Swetha Banagiri
Associate Production Engineer
Working as an Associate Production Engineer at Oath with one plus years' experience. With Hadoop/Storm/Spark infrastructure in place at Grid, the team manages 40+ clusters spanned across three colos with 50k+ nodes (data nodes and service nodes). Proficient in running Autodeploys, Debugging user/job issues (OS Level), setting up Centralized Logging for key Components, setting up Monitoring/Alerting service across clusters.