Hadoop HDFS: Introduction

Hadoop HDFS: Introduction

Overview/Description
Expected Duration
Lesson Objectives
Course Number
Expertise Level

Overview/Description

Explore the concepts of analyzing large data sets in this 12-video Skillsoft Aspire course, which deals with Hadoop and its Hadoop Distributed File System (HDFS), which enables parallel processing of big data efficiently in a distributed cluster. The course assumes a conceptual understanding of Hadoop and its components; purely theoretical, it contains no labs, with just enough information provided to understand how Hadoop and HDFS allow processing big data in parallel. The course opens by explaining the ideas of vertical and horizontal scaling, then discusses functions served by Hadoop to horizontally scale data processing tasks. Learners explore functions of YARN, MapReduce, and HDFS, covering how HDFS keeps track of where all pieces of large files are distributed, replication of data, and how HDFS is used with Zookeeper: a tool maintained by the Apache Software Foundation and used to provide coordination and synchronization in distributed systems, along with other services related to distributed computing—a naming service, configuration management, and so on. Learn about Spark, a data analytics engine for distributed data processing.

Expected Duration (hours)
1.2

Lesson Objectives

Hadoop HDFS: Introduction

discover the key concepts covered in this course

recognize the need to process massive datasets at scale

describe the benefits of horizontal scaling for processing big data and the challenges of this approach

recall the features of a distributed cluster which address the challenges of horizontal scaling

identify the features of HDFS which enables large datasets to be distributed across a cluster

describe the simple and high-availability architectures of HDFS and the implementations for each of them

identify the role of Hadoop's MapReduce in processing chunks of big datasets in parallel

recognize the role of the YARN resource negotiator in enabling Map and Reduce operations to execute on a cluster

describe the steps involved in resource allocation and job execution for operations on a Hadoop cluster

recall how Apache Zookeeper enables the HDFS NameNode and YARN ResourceManager to run in high-availability mode

identify various technologies which integrate with Hadoop and simplify the task of big data processing

recognize the key features of distributed clusters, HDFS, and the input outs of the Map and Reduce phases

Course Number:
it_dshdfsdj_01_enus

Expertise Level
Beginner