Hadoop with Spark

Course Overview:

Hadoop Development, Administration and BI Program is a one-stop course that introduces you to the domain of Hadoop development as well as gives you technical knowhow of the same and it is the most popular Big Data processing framework. Spark is a general-purpose distributed data processing engine that is suitable for use in a wide range of circumstances. On top of the Spark core data processing engine, there are libraries for SQL, machine learning, graph computation, and stream processing, which can be used together in an application.

Course Objectives:

  • Learn the basics of Big Data and Spark
  • Play with Hadoop and Hadoop Ecosystem
  • Become a top-notch Spark Developer

Pre-requisites:

  • Typically, professionals with basic knowledge of software development, programming languages, and databases will find this course helpful. Basic knowledge should be enough to succeed at this course
  • Not For: Students who are absolute beginners at software development as a discipline will find it difficult to follow the course

Target Audience:

  • Developers
  • Data Analysts

Course Duration:

  • 35 hours – 3 days

Course Content:

Phase 1: Hadoop Fundamentals with single node setup (Day 1)

Laying the foundation

Introduction to Hadoop and Spark

  • Ecosystem
  • Big Data Overview
  • Key Roles in Big Data Project
  • Key Business Use cases
  • Hadoop and Spark Logical Architecture
  • Typical Big Data Project Pipeline

Basic Concepts of HDFS

  • HDFS Overview
  • Physical Architectures of HDFS
  • The Hadoop Distributed File System Hands-on.

Hadoop Ecosystem

  • Sqoop
  • Hive

Introduction to Sqoop

  • What is sqoop
  • Import / Import all tables Data
  • Sqoop Job/Eval and Sqoop Code-gen
  • List databases/tables

Hadoop Hands On

  • Running HDFS commands
  • Running Sqoop Import and Sqoop Export

Introduction to Spark

  • Spark Overview
  • Detailed discussion on “Why Spark”
  • Quick Recap of MapReduce
  • Spark vs MapReduce
  • Why Python for Spark
  • Just Enough Python for Spark
  • Understanding of CDH Spark and Apache Spark

Phase 2: Hadoop Development (Day 2)

Become a Pro developer with Spark and Hive Datawarehouse

Spark Core Framework and API

  • High level Spark Architecture
  • Role of Executor, Driver, SparkSession etc.
  • Resilient Distributed Datasets
  • Basic operations in Spark Core API i.e.
  • Actions and Transformations
  • Using the Spark REPL for performing interactive data analysis
  • Hands-on Exercises
  • Integrating with Hive

Delving Deeper into Spark API

  • Pair RDDs
  • Implementing Map Reduce Algorithms using Spark
  • Ways to create Pair RDDs JSON Processing Code Example on JSON Processing
  • XML Processing
  • Joins
  • Playing with Regular Expressions
  • Log File Processing using Regular Expressions
  • Hands-on Exercises

Executing a Spark Application

  • Writing Standalone Spark Application
  • Various commands to execute and configure
  • Spark Applications in various modes
  • Discussion on Application, Job, Stage,
  • Executor, Tasks
  • Interpreting RDD Metadata/Lineage/DAG
  • Controlling degree of Parallelism in Spark Job
  • Physical execution of a Spark application
  • Discussion on: How Spark is better than
  • MapReduce?
  • Hands-on Exercises

Phase 3: Hadoop with Spark Dataframes and park SQL (Day 3 and 4)

Spark SQL

  • Dataframes in Depth
  • Creating Dataframes
  • Discussion on Different file formatORC, Sequence, Avro, Parquet and sequence
  • Dataframe internals that makes it fast – Catalyst Optimizer and Tungsten
  • Load data into Spark from external data sources like relational databases
  • Saving dataframe to external sources like HDFS, RDBMS
  • SQL features of Data frame
  • Data formats – text format such csv, json, xml, binary formats such as parquet,orc
  • UDF in Spark Dataframe
  • When to use UDF of hive or not?
  • CDC use cases
  • Spark optimization techniques-joins?
  • Integration with Teradata- use case

Understanding of Hive

  • Hive as a Data Warehouse
  • Creating Tables for Analysis of data
  • Techniques of Loading Data into Tables
  • Difference between Internal and External Tables
  • Understanding Hive Data Types Joining,Union datasets
  • Join Optimizations
  • Partitions and Bucketing
  • Running a Spark SQL Application
  • Dataframes on a JSON file
  • Dataframes on hive tables
  • Dataframes on JSON
  • Querying operations dataframes
  • Hive Writing HSQL queries for data retrieval

Phase 4: NoSQL and Cluster Walkthrough (Day 5)

Know Kafka Tool and Spark Streaming

Introduction to Kafka

  • Kafka Overview
  • Salient Features of Kafka
  • Topics, Brokers and Partitions
  • Kafka Use cases

Kafka Connect and Spark Streaming

  • Kafka Connect
  • Hands-on Exercise

Structured Streaming

  • Structured Streaming Overview
  • How it is better than Kafka streaming?
  • Hands-on Exercises integrating with Kafka using Spark Streaming

 

 

Course Customization Options

To request a customized training for this course, please contact us to arrange.

Best selling courses

CLOUD COMPUTING

Enterprise Architecture

DATA SCIENCE

Tableau Basic

ARTIFICIAL INTELLIGENCE / MACHINE LEARNING / DEEP LEARNING

RPA with UiPath

PROGRAMMING / CODING

MATLAB Fundamentals