Pre Event Training

Courses

 

HDP Essentials

This 1 day course details the business value for, and provides a technical overview of, Apache Hadoop. It includes high-level information about concepts, architecture, operation, and uses of the Hortonworks Data Platform (HDP) and the Hadoop ecosystem. The course serves as an optional primer for those who plan to attend a hands-on, instructor-led course.

Audience

Data architects, data integration architects, managers, C-level executives, decision makers, technical infrastructure team, and Hadoop administrators or developers who want to understand the fundamentals of Big Data and the Hadoop ecosystem.

Prerequisites

No previous Hadoop or programming knowledge is required. Students are encouraged to bring their wi-fi enabled laptop pre-loaded with the Hortonworks Sandbox should they want to duplicate demonstrations on their own machine.

Objectives

  • Understand what constitutes “Big Data” and how Hadoop is critical to process & analyze it
  • Describe the business value and primary use cases for Hadoop
  • Understand how Hadoop fits into your existing infrastructure and processes
  • Explore Hadoop ecosystem through HDP’s five pillars
    • Data Management: HDFS and YARN
    • Data Access: Spark, Storm, Pig, Hive, Tez, MapReduce, HBase, Accumulo, HCatalog, Kafka, Solr, Mahout and Slider
    • Data Governance & Integration: Atlas, Falcon, Sqoop and Flume
    • Security: Knox and Ranger
    • Operarations: Ambari, Oozie and ZooKeeper
  • Discuss the value of partner integrations and the Modern Data Architecture
  • Learn about the security features that span the Hadoop ecosystem
  • Share knowledge to allow decisions to be made of how Hadoop can be used in enterprise use cases and architectures

Demos

  • Operational Overview with Ambari
  • Ingesting Data into HDFS
  • Streaming Data into HDFS
  • Data Manipulation with Hive
  • Risk Factor Analysis with Pig
  • Risk Factor Analysis with Spark
  • Securing Hive with Ranger

 

HDP Data Science

Learn Data Science techniques and best practices leveraging the Hadoop ecosystem and tools in this 2 day course.

AUDIENCE

Architects, software developers, analysts and data scientists who need to apply data science and machine learning on Apache Hadoop.

PREREQUISITES

Students must have experience with at least one programming such as Python, or scripting language, knowledge in statistics and/or mathematics, and a basic understanding of big data and Hadoop principles.

Objectives

  • Recognize use cases for data science
  • Describe the architecture of Hadoop and YARN
  • Explain the differences between supervised and unsupervised learning
  • List the six machine learning tasks
  • Recognize use cases for clustering, outlier detection, affinity analysis, classification, regression, and recommendation
  • Use Mahout to run a machine learning algorithm on Hadoop
  • Write Pig scripts to transform data on Hadoop
  • Use Pig to prepare data for a machine learning algorithm
  • Write a Python script
  • Use NumPy to analyze big data
  • Use the data structure classes in the pandas library
  • Write a Python script that invokes a SciPy machine learning algorithm
  • Explain the options for running Python code on a Hadoop cluster
  • Write a Pig User Defined Function in Python
  • Use Pig streaming on Hadoop with a Python script
  • Write a Python script that invokes a scikit-­‐learn machine learning algorithm
  • Use the k-­‐nearest neighbor algorithm to predict values based on a data set
  • Run the k-­‐means clustering algorithm on a distributed data set on Hadoop
  • Describe use cases for Natural Language Processing (NLP)
  • Run an NLP algorithm on a Hadoop cluster
  • Run machine learning algorithms on Hadoop using Spark MLlib

Labs

  • Describe the architecture of Hadoop and YARN
  • Explain the differences between supervised and unsupervised learning
  • Recognize use cases for clustering, outlier detection, affinity analysis, classification, regression, and recommendation
  • Write Pig scripts to transform data on Hadoop
  • Use Pig to prepare data for a machine learning algorithm
  • Write a Python script using NumPy, Scipy, Matplotlib, Pandas, and Scikit-learn to analyze big data
  • Exercise the options for running Python code on a Hadoop cluster
  • Write a Pig User Defined Function in Python
  • Use Pig streaming on Hadoop with a Python scriptRun a Hadoop Streaming job
  • Understand some key tasks in Natural Language Processing (NLP)
  • Run an NLP algorithms on IPython
  • Run machine learning algorithms on Hadoop using Spark MLlib

 

HDP Operations: Administration

This 2 day course is designed for administrators who will be managing the Hortonworks Data Platform (HDP) 2.3 with Ambari. It Covers installation, configuration, and other typical cluster management tasks.

Audience

IT administrators and operators responsible for installing, configuring, and supporting an HDP 2.3 deployment in a Linux environment using Ambari

Prerequisites

No previous Hadoop knowledge is required, though will be useful. Attendees should be familiar with data center operations and Linux system administration. Students will need to bring their wi-fi enabled laptop pre-loaded with Chrome or Firefox browser in order to complete hands-on labs.

Objectives

  • Add, Remove, Replace Cluster Nodes
  • Configure Rack Awareness
  • Configure High Availability NameNode and YARN Resource Manager
  • Manage Hadoop Services
  • Manage HDFS Storage
  • Manage YARN
  • Configure Capacity Scheduler
  • Monitor Cluster

Labs

  • Install HDP
  • Managing Ambari User and Groups
  • Manage Hadoop Services
  • Using Hadoop Storage
  • Managing Hadoop Storage
  • Managing YARN Service using Ambari Web UI
  • Managing YARN Service using CLI
  • Setting UP for Capacity Scheduler
  • Managing YARN Containers and Queues
  • Managing YARN ACLs and User Limits
  • Adding, Decommissioning and Recommissioning Worker Nodes
  • Configuring Rack Awareness
  • Configuring NameNode HA
  • Configuring ResourceManger HA

 

HDP Developer: Spark

This 2 day course is designed for developers who need to create applications to analyze Big Data stored in Apache Hadoop using Spark. The focus will be on utilizing the Spark API from Python or Scala.

Audience

Developers, Architects, and Admins who would like to learn more about developing data applications in Spark, how it will affect their environment, and ways to optimize application.

Prerequisites

No previous Hadoop knowledge is required, though will be useful. Basic knowledge of Python or Scala is required. Previous exposure to SQL is helpful, but not required. Students will need to bring their wi-fi enabled laptop pre-loaded with Chrome or Firefox browser in order to complete hands-on labs.

Objectives

  • Describe Spark and Spark specific use cases
  • Explain the differences between Spark and MapReduce
  • Explore data interactively through the spark shell utility
  • Explain the RDD concept
  • Use the Python/Scala Spark APIs
  • Create all types of RDDs: Pair, Double, and Generic
  • Use RDD type-specific functions
  • Explain interaction of components of a Spark Application
  • Explain the creation of the DAG schedule
  • Build and package Spark applications
  • Use application configuration items
  • Deploy applications to the cluster using YARN
  • Use data caching to increase performance of applications
  • Implement advanced features of spark
  • Learn general application optimization guidelines/tips
  • Create/transform data using dataframes
  • Read, use, and save to different Hadoop file formats
  • Understand the concepts of Spark Streaming
  • Create a streaming application
  • Use Spark MLlib to gain insights from data

Labs

  • Create a Spark “Hello World” word count application
  • Use advanced RDD programming to perform sort, join, pattern matching and regex tasks
  • Explore partitioning and the Spark UI
  • Increase performance using data caching
  • Build/package a Spark application using Maven
  • Use a broadcast variable to efficiently join a small dataset to a massive dataset
  • Use an accumulator for reporting data quality issues
  • Create a dataframe and perform analysis
  • Load/transform/store data using Spark with Hive tables
  • Create a point-in-time spark stream application
  • Create a spark stream application using window functions

 

HDF Operations

This 2 day course is designed for ‘Data Stewards’ or ‘Data Flow Managers’ who are looking forward to automate the flow of data between systems.

Audience

Data Engineers, Integration Engineers and Architects who are looking forward to automate Data flow between systems.

Prerequisites

Good to have some experience with Linux and basic understanding of DataFlow tools. Students will need to bring their wi-fi enabled laptop pre-loaded with Chrome or Firefox browser in order to complete hands-on labs.

Objectives

  • Understand what is HDF and what is Nifi, Core concepts and use cases
  • Understanding Nifi Architecture and Key features
  • Learn deep about Nifi User interface and how to build a Data flow
  • Understanding a Nifi Processor, Connection, Process Groups and Remote Process Groups
  • Basic overview of Data Flow Optimization and Data Provenance
  • Understanding Nifi Expression Language
  • Installing and Configuring a Nifi Cluster
  • Understanding security and monitoring options for HDF
  • Integrating HDF and HDP
  • HDF System and Nifi best practices

Labs

  • Building a NiFi Data Flow
  • Working With Processor Group
  • Working With Remote Processor Group [Site-­‐to-­‐Site]
  • NiFi Expression Language
  • Using Templates
  • Working With NiFi Cluster
  • NiFi Monitoring
  • HDF Integration with HDP [Spark,Kafka,Hbase]
  • Securing HDF with 2-way SSL
  • NiFi User Authentication with LDAP
  • End of the course project