Final Exam: Data Analyst

Final Exam: Data Analyst

Overview/Description
Expected Duration
Lesson Objectives
Course Number
Expertise Level

Overview/Description

Final Exam: Data Analyst will test your knowledge and application of the topics presented throughout the Data Analyst track of the Skillsoft Aspire Data Analyst to Data Scientist Journey.

Expected Duration (hours)
0.0

Lesson Objectives

Final Exam: Data Analyst

build and run the application and confirm the output using HDFS from both the command line and the web application

compare and contrast SQL and NoSQL database solutions

configure a JDBC connection on Glue to the Redshift cluster

configure and view permissions for individual files and directories using the getfacl and chmod commands

configure HDFS using the hdfs-site.xml file and identify the properties which can be set from it

crawl data stored in a DynamoDB table

create and configure a Hadoop cluster on the Google Cloud Platform using its Cloud Dataproc service

create and configure simple graphs with lines and markers using the Matplotlib data visualization library

create and load data into an RDD

Create data frames in R

create matrices in R

create vectors in R

define linear regression

define the contents of a DataFrame using the SQLContext

define the inter-quartile range of a dataset and enumerate its properties

Define the mean of a dataset and enumerate its properties

delete a Google Cloud Dataproc cluster and all of its associated resources

deploy DynamoDB in the Amazon Web Services cloud

describe and apply the different techniques involved in handling datasets where some information is missing

describe NoSQL Stores and how they are used

describe the concept of hierarchical index or multi-index and why can be useful

describe the ETL process and different tools available

describe the options available when iterating over 1-dimensional and multi-dimensional arrays

draw the shape of a Gaussian distribution and enumerate its defining properties

edit individual cells and entire rows and columns in a Pandas DataFrame

execute the application and verify that the filtering has worked correctly; examine the job and the output files using the YARN Cluster Manager and HDFS NameNode web UIs

explain the concept of hierarchical index or multi-index and why can be useful

export the contents of a DataFrame into files of various formats

identify different tools available for data management

identify the various GCP services used by Dataproc when provisioning a cluster

import and export data in R

initialize a Spark DataFrame from the contents of an RDD

install Pandas and create a Pandas Series

list the six phases of the data lifecycle

load data into a Redshift cluster from S3 buckets

read data from an Excel spreadsheet

read data from files and write data to files using the Python Pandas library

recall how Apache Zookeeper enables the HDFS NameNode and YARN ResourceManager to run in high-availability mode

recall the steps involved in building a MapReduce application and the specific workings of the Map phase in processing each row of data in the input file

recognize and deal with missing data in R

recognize the challenges involved in processing big data and the options available to address them such as vertical and horizontal scaling

retrieve specific parts of an array using row and column indices

run ETL scripts using Glue

run the application and examine the outputs generated to get the word frequencies in the input text document

set up a JDBC connection on Glue to the Redshift cluster

specify the configurations of the MapReduce applications in the Driver program and the project's pom.xml file

standardize a distribution to express its values as z-scores and use Pandas to generate a correlation and covariance matrix for your dataset

transfer files from your local file system to HDFS using the copyFromLocal command

use fancy indexing with arrays using an index mask

use NumPy to compute statistics such as the mean and median on your data

use NumPy to compute the correlation and covariance of two distributions and visualize their relationship with scatterplots

use the dplyr library to load data frames

use the get and getmerge functions to retrieve one or multiple files from HDFS

use the ggplot2 library to visualize data using R

use the NumPy library to manipulate arrays and the Pandas library to load and analyze a dataset

using the independent t-test and with a related sample using a paired t-test using the SciPy library

using the mutate method

work with the YARN Cluster Manager and HDFS NameNode web applications that come packaged with Hadoop

write a simple bash script

Course Number:
it_fedads_01_enus

Expertise Level
Beginner