Hadoop with Spark

Course Description

Course Overview:

Hadoop Development, Administration and BI Program is a one-stop course that introduces you to the domain of Hadoop development as well as gives you technical knowhow of the same and it is the most popular Big Data processing framework. Spark is a general-purpose distributed data processing engine that is suitable for use in a wide range of circumstances. On top of the Spark core data processing engine, there are libraries for SQL, machine learning, graph computation, and stream processing, which can be used together in an application.

Course Objectives:

Learn the basics of Big Data and Spark
Play with Hadoop and Hadoop Ecosystem
Become a top-notch Spark Developer

Pre-requisites:

Typically, professionals with basic knowledge of software development, programming languages, and databases will find this course helpful. Basic knowledge should be enough to succeed at this course
Not For: Students who are absolute beginners at software development as a discipline will find it difficult to follow the course

Target Audience:

Developers
Data Analysts

Course Duration:

35 hours – 3 days

Course Content:

Phase 1: Hadoop Fundamentals with single node setup (Day 1)

Laying the foundation

Introduction to Hadoop and Spark

Ecosystem
Big Data Overview
Key Roles in Big Data Project
Key Business Use cases
Hadoop and Spark Logical Architecture
Typical Big Data Project Pipeline

Basic Concepts of HDFS

HDFS Overview
Physical Architectures of HDFS
The Hadoop Distributed File System Hands-on.

Hadoop Ecosystem

Sqoop
Hive

Introduction to Sqoop

What is sqoop
Import / Import all tables Data
Sqoop Job/Eval and Sqoop Code-gen
List databases/tables

Hadoop Hands On

Running HDFS commands
Running Sqoop Import and Sqoop Export

Introduction to Spark

Spark Overview
Detailed discussion on “Why Spark”
Quick Recap of MapReduce
Spark vs MapReduce
Why Python for Spark
Just Enough Python for Spark
Understanding of CDH Spark and Apache Spark

Phase 2: Hadoop Development (Day 2)

Become a Pro developer with Spark and Hive Datawarehouse

Spark Core Framework and API

High level Spark Architecture
Role of Executor, Driver, SparkSession etc.
Resilient Distributed Datasets
Basic operations in Spark Core API i.e.
Actions and Transformations
Using the Spark REPL for performing interactive data analysis
Hands-on Exercises
Integrating with Hive

Delving Deeper into Spark API

Pair RDDs
Implementing Map Reduce Algorithms using Spark
Ways to create Pair RDDs JSON Processing Code Example on JSON Processing
XML Processing
Joins
Playing with Regular Expressions
Log File Processing using Regular Expressions
Hands-on Exercises

Executing a Spark Application

Writing Standalone Spark Application
Various commands to execute and configure
Spark Applications in various modes
Discussion on Application, Job, Stage,
Executor, Tasks
Interpreting RDD Metadata/Lineage/DAG
Controlling degree of Parallelism in Spark Job
Physical execution of a Spark application
Discussion on: How Spark is better than
MapReduce?
Hands-on Exercises

Phase 3: Hadoop with Spark Dataframes and park SQL (Day 3 and 4)

Spark SQL

Dataframes in Depth
Creating Dataframes
Discussion on Different file formatORC, Sequence, Avro, Parquet and sequence
Dataframe internals that makes it fast – Catalyst Optimizer and Tungsten
Load data into Spark from external data sources like relational databases
Saving dataframe to external sources like HDFS, RDBMS
SQL features of Data frame
Data formats – text format such csv, json, xml, binary formats such as parquet,orc
UDF in Spark Dataframe
When to use UDF of hive or not?
CDC use cases
Spark optimization techniques-joins?
Integration with Teradata- use case