Spark Core

Spark Core

Overview/Description
Target Audience
Prerequisites
Expected Duration
Lesson Objectives
Course Number
Expertise Level

Overview/Description
Spark Core provides basic I/O functionalities, distributed task dispatching, and scheduling. Resilient Distributed Datasets (RDDs) are logical collections of data partitioned across machines. RDDs can be created by referencing datasets in external storage systems, or by applying transformations on existing RDDs. In this course, you will learn how to improve Spark's performance and work with Data Frames and Spark SQL.

Target Audience
Programmers and developers familiar with Apache Spark who wish to expand their skill sets

Prerequisites
None

Expected Duration (hours)
2.1

Lesson Objectives

Spark Core

start the course

recall what is included in the Spark Stack

define lazy evaluation as it relates to Spark

recall that RDD is an interface comprised of a set of partitions, list of dependencies, and functions to compute

pre-partition an RDD for performance

store RDDS in serialized form

perform numeric operations on RDDs

create custom accumulators

use broadcast functionality for optimization

pipe to external applications

adjust garbage collection settings

perform batch import on a Spark cluster

determine memory consumption

tune data structures to reduce memory consumption

use Spark's different shuffle operations to minimize memory usage of reduce tasks

set the levels of parallelism for each operation

create DataFrames

interoperate with RDDs

describe the generic load and save functions

read and write Parquet files

use JSON Dataset as a DataFrame

read and write data in Hive tables

read and write data using JDBC

run the Thrift JDBC/OCBC server

show the different ways to tune up Spark for better performance

Course Number:
df_apsa_a01_it_enus

Expertise Level
Intermediate