Hadoop with Spark

Course Description

Course Overview:

Hadoop Development, Administration and BI Program is a one-stop course that introduces you to the domain of Hadoop development as well as gives you technical knowhow of the same and it is the most popular Big Data processing framework. Spark is a general-purpose distributed data processing engine that is suitable for use in a wide range of circumstances. On top of the Spark core data processing engine, there are libraries for SQL, machine learning, graph computation, and stream processing, which can be used together in an application.

Course Objectives:

Learn the basics of Big Data and Spark
Play with Hadoop and Hadoop Ecosystem
Become a top-notch Spark Developer

Pre-requisites:

Typically, professionals with basic knowledge of software development, programming languages, and databases will find this course helpful. Basic knowledge should be enough to succeed at this course
Not For: Students who are absolute beginners at software development as a discipline will find it difficult to follow the course

Target Audience:

Developers
Data Analysts

Course Duration:

35 hours – 3 days

Course Content:

Phase 1: Hadoop Fundamentals with single node setup (Day 1)

Laying the foundation

Introduction to Hadoop and Spark

Ecosystem
Big Data Overview
Key Roles in Big Data Project
Key Business Use cases
Hadoop and Spark Logical Architecture
Typical Big Data Project Pipeline

Basic Concepts of HDFS

HDFS Overview
Physical Architectures of HDFS
The Hadoop Distributed File System Hands-on.

Hadoop Ecosystem

Sqoop
Hive

Introduction to Sqoop

What is sqoop
Import / Import all tables Data
Sqoop Job/Eval and Sqoop Code-gen
List databases/tables

Hadoop Hands On

Running HDFS commands
Running Sqoop Import and Sqoop Export

Introduction to Spark

Spark Overview
Detailed discussion on “Why Spark”
Quick Recap of MapReduce
Spark vs MapReduce
Why Python for Spark
Just Enough Python for Spark
Understanding of CDH Spark and Apache Spark

Phase 2: Hadoop Development (Day 2)

Become a Pro developer with Spark and Hive Datawarehouse

Spark Core Framework and API

High level Spark Architecture
Role of Executor, Driver, SparkSession etc.
Resilient Distributed Datasets
Basic operations in Spark Core API i.e.
Actions and Transformations
Using the Spark REPL for performing interactive data analysis
Hands-on Exercises
Integrating with Hive

Delving Deeper into Spark API

Pair RDDs
Implementing Map Reduce Algorithms using Spark
Ways to create Pair RDDs JSON Processing Code Example on JSON Processing
XML Processing
Joins
Playing with Regular Expressions
Log File Processing using Regular Expressions
Hands-on Exercises

Executing a Spark Application

Writing Standalone Spark Application
Various commands to execute and configure
Spark Applications in various modes
Discussion on Application, Job, Stage,
Executor, Tasks
Interpreting RDD Metadata/Lineage/DAG
Controlling degree of Parallelism in Spark Job
Physical execution of a Spark application
Discussion on: How Spark is better than
MapReduce?
Hands-on Exercises

Phase 3: Hadoop with Spark Dataframes and park SQL (Day 3 and 4)

Spark SQL

Dataframes in Depth
Creating Dataframes
Discussion on Different file formatORC, Sequence, Avro, Parquet and sequence
Dataframe internals that makes it fast – Catalyst Optimizer and Tungsten
Load data into Spark from external data sources like relational databases
Saving dataframe to external sources like HDFS, RDBMS
SQL features of Data frame
Data formats – text format such csv, json, xml, binary formats such as parquet,orc
UDF in Spark Dataframe
When to use UDF of hive or not?
CDC use cases
Spark optimization techniques-joins?
Integration with Teradata- use case

Understanding of Hive

Hive as a Data Warehouse
Creating Tables for Analysis of data
Techniques of Loading Data into Tables
Difference between Internal and External Tables
Understanding Hive Data Types Joining,Union datasets
Join Optimizations
Partitions and Bucketing
Running a Spark SQL Application
Dataframes on a JSON file
Dataframes on hive tables
Dataframes on JSON
Querying operations dataframes
Hive Writing HSQL queries for data retrieval

Phase 4: NoSQL and Cluster Walkthrough (Day 5)

Know Kafka Tool and Spark Streaming

Introduction to Kafka

Kafka Overview
Salient Features of Kafka
Topics, Brokers and Partitions
Kafka Use cases

Kafka Connect and Spark Streaming

Kafka Connect
Hands-on Exercise

Structured Streaming

Structured Streaming Overview
How it is better than Kafka streaming?
Hands-on Exercises integrating with Kafka using Spark Streaming

Course Customization Options

To request a customized training for this course, please contact us to arrange.

Inquire now

Best selling courses

BUSINESS / FINANCE / BLOCKCHAIN / FINTECH

Establishing Effective Metrics

PROJECT MANAGEMENT / AGILE & SCRUM

Agile Program Management

CLOUD COMPUTING

Introduction to Cloud Computing

CLOUD COMPUTING

Networking in Google Cloud Platform

CYBER SECURITY

Secure coding in PHP

DEV OPS / CONTAINERS

Docker and Kubernetes for Administrator

Hadoop with Spark

Course Overview:

Course Objectives:

Pre-requisites:

Target Audience:

Course Duration:

Course Content:

Course Customization Options

To request a customized training for this course, please contact us to arrange.

Best selling courses

client premises

virtual learning

trainosys Classroom

Hadoop with Spark

Course Overview:

Course Objectives:

Pre-requisites:

Target Audience:

Course Duration:

Course Content:

Course Customization Options

To request a customized training for this course, please contact us to arrange.

Related Courses

Related Courses

Best selling courses

Training Inquiry Information

Free video intake

Leave your details

client premises

CHOICE OF LOCATION

COURSE CUSTOMIZATION

VENUE

virtual learning

COST-EFFICIENT

ACCESSIBILE

LEARNING ASSIST

trainosys Classroom

LOCATION

COURSE CUSTOMIZATION

LEARNING EXPERIENCE

Login