Data Refinery with YARN and MapReduce


Overview/Description
Target Audience
Prerequisites
Expected Duration
Lesson Objectives
Course Number
Expertise Level



Overview/Description
The core of Hadoop consists of a storage part, HDFS, and a processing part, MapReduce. Hadoop splits files into large blocks and distributes the blocks amongst the nodes in the cluster. To process the data, Hadoop and MapReduce transfer code to nodes that have the required data, which the nodes then process in parallel. This approach takes advantage of data locality to allow the data to be processed faster and more efficiently via distributed processing than by using a more conventional supercomputer architecture that relies on a parallel file system where computation and data are connected via high-speed networking. In this course, you'll learn about the theory of YARN as a parallel processing framework for Hadoop. You'll also learn about the theory of MapReduce as the backbone of parallel processing jobs. Finally, this course demonstrates MapReduce in action by explaining the pertinent classes and then walk through a MapReduce program step by step. This learning path can be used as part of the preparation for the Cloudera Certified Administrator for Apache Hadoop (CCA-500) exam.

Target Audience
Technical personnel with a background in Linux, SQL, and programming who intend to join a Hadoop Engineering team in roles such as Hadoop developer, data architect, or data engineer or roles related to technical project management, cluster operations, or data analysis

Prerequisites
None

Expected Duration (hours)
1.6

Lesson Objectives

Data Refinery with YARN and MapReduce

  • start the course
  • describe parallel processing in the context of supercomputing
  • list the components of YARN and identify their primary functions
  • diagram YARN Resource Manager and identify its key components
  • diagram YARN Node Manager and identify its key components
  • diagram YARN ApplicationMaster and identify its key components
  • describe the operations of YARN
  • identify the standard configuration parameters to be changed for YARN
  • define the principle concepts of key-value pairs and list the rules for key-value pairs
  • describe how MapReduce transforms key-value pairs
  • load a large text book and then run WordCount to count the number of words in the text book
  • label all of the functions for MapReduce on a diagram
  • match the phases of MapReduce to their definitions
  • set up the classpath and test WordCount
  • build a JAR file and run WordCount
  • describe the base mapper class of the MapReduce Java API and describe how to override its methods
  • describe the base Reducer class of the MapReduce Java API and describe how to override its methods
  • describe the function of the MapReduceDriver Java class
  • set up the classpath and test a MapReduce job
  • identify the concept of streaming for MapReduce
  • stream a Python job
  • understand YARN features and components, as well as MapReduce and its classes
  • Course Number:
    df_ahec_a06_it_enus

    Expertise Level
    Intermediate