Data Flow for the Hadoop Ecosystem


Overview/Description
Target Audience
Prerequisites
Expected Duration
Lesson Objectives
Course Number
Expertise Level



Overview/Description
Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the GFS and of the MapReduce computing paradigm. You'll explore a demonstration of the use of Sqoop and Hive with Hadoop to flow and fuse data. The demonstration includes preprocessing data, partitioning data and joining data. This learning path can be used as part of the preparation for the Cloudera Certified Administrator for Apache Hadoop (CCA-500) exam.

Target Audience
Technical personnel with a background in Linux, SQL, and programming who intend to join a Hadoop Engineering team in roles such as Hadoop developer, data architect, or data engineer or roles related to technical project management, cluster operations, or data analysis

Prerequisites
None

Expected Duration (hours)
1.9

Lesson Objectives

Data Flow for the Hadoop Ecosystem

  • start the course
  • describe the data life cycle management
  • recall the parameters that must be set in the Sqoop import statement
  • create a table and load data into MySQL
  • use Sqoop to import data into Hive
  • recall the parameters that must be set in the Sqoop export statement
  • use Sqoop to export data from Hive
  • recall the three most common date datatypes and which systems support each
  • use casting to import datetime stamps into Hive
  • export datetime stamps from Hive into MySQL
  • describe dirty data and how it should be preprocessed
  • use Hive to create tables outside the warehouse
  • use pig to sample data
  • recall some other popular components for the Hadoop Ecosystem
  • recall some best practices for pseudo-mode implementation
  • write custom scripts to assist with administrative tasks
  • troubleshoot classpath errors
  • create complex configuration files
  • to use Sqoop and Hive for data flow and fusion in the Hadoop ecosystem
  • Course Number:
    df_ahec_a10_it_enus

    Expertise Level
    Intermediate