Accessing Data with Spark: Data Analysis Using the Spark DataFrame API


Overview/Description
Expected Duration
Lesson Objectives
Course Number
Expertise Level



Overview/Description

An open-source cluster-computing framework used for data science, Apache Spark has become the de facto big data framework. In this Skillsoft Aspire course, learners explore how to analyze real data sets by using DataFrame API methods. Discover how to optimize operations with shared variables and combine data from multiple DataFrames using joins. Explore the Spark 2.x version features that make it significantly faster than Spark 1.x. Other topics include how to create a Spark DataFrame from a CSV file; apply DataFrame transformations, grouping, and aggregation; perform operations on a DataFrame to analyze categories of data in a data set. Visualize the contents of a Spark DataFrame, with Matplotlib. Conclude by studying how to broadcast variables and DataFrame contents in text file format.



Expected Duration (hours)
1.2

Lesson Objectives

Accessing Data with Spark: Data Analysis Using the Spark DataFrame API

  • Course Overview
  • recognize the features that make Spark 2.x versions significantly faster than Spark 1.x
  • specify the reasons for using shared variables in your Spark application and distinguish between the two options available for sharing variables
  • create a Spark DataFrame from the contents of a CSV file and apply some simple transformations on the DataFrame
  • define a transformation to view a random sample of data from a large DataFrame
  • apply grouping and aggregation operations on a DataFrame to analyze categories of data in a dataset
  • use Matplotlib to visualize the contents of a Spark DataFrame
  • perform operations to prepare your dataset for analysis by trimming unnecessary columns and rows containing missing data
  • define and apply a generic transformation on a DataFrame
  • apply complex transformations on a DataFrame to extract meaningful information from a dataset
  • work with broadcast variables and perform a join operation with a DataFrame that has been broadcast
  • use a Spark accumulator as a counter
  • store the contents of a DataFrame in a text file for archiving or sharing
  • define and work with a custom accumulator to count a vector of values
  • perform different join operations on Spark DataFrames to combine data from multiple sources
  • analyze data using the DataFrame API
  • Course Number:
    it_dsadskdj_02_enus

    Expertise Level
    Beginner