Final Exam: Data Wrangler


Overview/Description
Expected Duration
Lesson Objectives
Course Number
Expertise Level



Overview/Description

Final Exam: Data Wrangler will test your knowledge and application of the topics presented throughout the Data Wrangler track of the Skillsoft Aspire Data Analyst to Data Scientist Journey.



Expected Duration (hours)
0.0

Lesson Objectives

Final Exam: Data Wrangler

  • apply a group by transformation to aggregate with a conditional value
  • apply grouping and aggregation operations on a DataFrame to analyze categories of data in a dataset
  • build and run the application and confirm the output using HDFS from both the command line and the web application
  • change column values by applying functions
  • change date formats to the ISO 8601 standard
  • code up a Combiner for the MapReduce application and configure the Driver to use it for a partial reduction on the Mapper nodes of the cluster
  • compare managed and external tables in Hive and how they relate to the underlying data
  • configure and test PyMongo in a Python program
  • configure the Reducer and the Driver for the inverted index application
  • create and analyze categories of data in a dataset using Windows
  • Create and configure Pandas dataFrame objects
  • Create and configure pandas series object
  • create and instantiate a directed acyclic graph in Airflow
  • create a Spark DataFrame from the contents of a CSV file and apply some simple transformations on the DataFrame
  • create the driver program for the MapReduce application
  • define and run a join query involving two related tables
  • define a vehicle type that can be used to represent automobiles to be stored in a Java PriorityQueue
  • define the Mapper for a MapReduce application to build an inverted index from a set of text files
  • define what a window is in the context of Spark DataFrames and when they can be used
  • demonstrate how to ingest data using Sqoop
  • describe data ingestion approaches and compare Avro and Parquet file format benefits
  • describe the beneficial features that we can achieve using serverless and lambda architectures
  • describe the data processing strategies provided by MapReduce V2, Hive, Pig, and Yam for processing data with data lakes
  • describe the different primitive and complex data types available in Hive
  • extract subsets of data using filtering
  • flatten multi-dimensional data structures by chaining lateral views
  • handle common errors encountered when reading CSV data
  • identify and troubleshoot missing data
  • identify and work with time-series data
  • identify kinds of masking operations
  • implement a multi-stage aggregation pipeline
  • implement data lakes using AWS
  • implement deep learning using Keras
  • install MongoDB and implement data partitioning using MongoDB
  • list the prominent distributed data models along with their associative implementation benefits
  • list the various frameworks that can be used to process data from data lakes
  • load a few rows of data into a table and query it with simple select statements
  • load multiple sheets from an Excel document
  • perform create, read, update, and delete operations on a MongoDB document
  • perform statistical operations on DataFrames
  • plot pie charts, box plots, and scatter plots using Pandas
  • recall the prominent data pattern implementation in microservices
  • recognize the capabilities of Microsoft machine learning tools
  • recognize the machine learning tools provided by AWS for data analysis
  • recognize the read and write optimizations in MongoDB
  • setup and install Apache Airflow
  • split columns based on a pattern
  • test Airflow tasks using the airflow command line utility
  • trim and clean a DataFrame before a view is created as a precursor to running SQL queries on it
  • use a regular expression to extract data into a new column
  • use a Spark accumulator as a counter
  • use createIndex to build an index on a collection
  • use Maven to create a new project for a MapReduce application and plan out the Map and Reduce phases by examining the auto prices dataset
  • use the alter table statement to change the definition of a Hive table
  • use the find operation to select documents from a collection
  • use the mongoexport tool to export data from MongoDB to JSON and CSV
  • use the mongoimport tool to import from JSON and CSV
  • use the UNION and UNION ALL operations on table data and distinguish between the two
  • work with data in the form of key-value pairs - map data structures in Hive
  • work with scikit-learn to implement machine learning
  • Course Number:
    it_fedads_02_enus

    Expertise Level
    Intermediate