Data Refinery with YARN and MapReduce

Data Refinery with YARN and MapReduce

Overview/Description
Target Audience
Prerequisites
Expected Duration
Lesson Objectives
Course Number
Expertise Level

Overview/Description
The core of Hadoop consists of a storage part, HDFS, and a processing part, MapReduce. Hadoop splits files into large blocks and distributes the blocks amongst the nodes in the cluster. To process the data, Hadoop and MapReduce transfer code to nodes that have the required data, which the nodes then process in parallel. This approach takes advantage of data locality to allow the data to be processed faster and more efficiently via distributed processing than by using a more conventional supercomputer architecture that relies on a parallel file system where computation and data are connected via high-speed networking. In this course, you'll learn about the theory of YARN as a parallel processing framework for Hadoop. You'll also learn about the theory of MapReduce as the backbone of parallel processing jobs. Finally, this course demonstrates MapReduce in action by explaining the pertinent classes and then walk through a MapReduce program step by step. This learning path can be used as part of the preparation for the Cloudera Certified Administrator for Apache Hadoop (CCA-500) exam.

Target Audience
Technical personnel with a background in Linux, SQL, and programming who intend to join a Hadoop Engineering team in roles such as Hadoop developer, data architect, or data engineer or roles related to technical project management, cluster operations, or data analysis

Prerequisites
None

Expected Duration (hours)
1.6

Lesson Objectives

Data Refinery with YARN and MapReduce

start the course

describe parallel processing in the context of supercomputing

list the components of YARN and identify their primary functions

diagram YARN Resource Manager and identify its key components

diagram YARN Node Manager and identify its key components

diagram YARN ApplicationMaster and identify its key components

describe the operations of YARN

identify the standard configuration parameters to be changed for YARN

define the principle concepts of key-value pairs and list the rules for key-value pairs

describe how MapReduce transforms key-value pairs

load a large text book and then run WordCount to count the number of words in the text book

label all of the functions for MapReduce on a diagram

match the phases of MapReduce to their definitions

set up the classpath and test WordCount

build a JAR file and run WordCount

describe the base mapper class of the MapReduce Java API and describe how to override its methods

describe the base Reducer class of the MapReduce Java API and describe how to override its methods

describe the function of the MapReduceDriver Java class

set up the classpath and test a MapReduce job

identify the concept of streaming for MapReduce

stream a Python job

understand YARN features and components, as well as MapReduce and its classes

Course Number:
df_ahec_a06_it_enus

Expertise Level
Intermediate