Hadoop with Spark

Inquire now

Course Overview:

Hadoop Development, Administration and BI Program is a one-stop course that introduces you to the domain of Hadoop development as well as gives you technical knowhow of the same and it is the most popular Big Data processing framework. Spark is a general-purpose distributed data processing engine that is suitable for use in a wide range of circumstances. On top of the Spark core data processing engine, there are libraries for SQL, machine learning, graph computation, and stream processing, which can be used together in an application.

Course Objectives:

  • Learn the basics of Big Data and Spark
  • Play with Hadoop and Hadoop Ecosystem
  • Become a top-notch Spark Developer

Pre-requisites:

  • Typically, professionals with basic knowledge of software development, programming languages, and databases will find this course helpful. Basic knowledge should be enough to succeed at this course
  • Not For: Students who are absolute beginners at software development as a discipline will find it difficult to follow the course

Target Audience:

  • Developers
  • Data Analysts

Course Duration:

  • 35 hours – 3 days

Course Content:

Phase 1: Hadoop Fundamentals with single node setup (Day 1)

Laying the foundation

Introduction to Hadoop and Spark

  • Ecosystem
  • Big Data Overview
  • Key Roles in Big Data Project
  • Key Business Use cases
  • Hadoop and Spark Logical Architecture
  • Typical Big Data Project Pipeline

Basic Concepts of HDFS

  • HDFS Overview
  • Physical Architectures of HDFS
  • The Hadoop Distributed File System Hands-on.

Hadoop Ecosystem

  • Sqoop
  • Hive

Introduction to Sqoop

  • What is sqoop
  • Import / Import all tables Data
  • Sqoop Job/Eval and Sqoop Code-gen
  • List databases/tables

Hadoop Hands On

  • Running HDFS commands
  • Running Sqoop Import and Sqoop Export

Introduction to Spark

  • Spark Overview
  • Detailed discussion on “Why Spark”
  • Quick Recap of MapReduce
  • Spark vs MapReduce
  • Why Python for Spark
  • Just Enough Python for Spark
  • Understanding of CDH Spark and Apache Spark

Phase 2: Hadoop Development (Day 2)

Become a Pro developer with Spark and Hive Datawarehouse

Spark Core Framework and API

  • High level Spark Architecture
  • Role of Executor, Driver, SparkSession etc.
  • Resilient Distributed Datasets
  • Basic operations in Spark Core API i.e.
  • Actions and Transformations
  • Using the Spark REPL for performing interactive data analysis
  • Hands-on Exercises
  • Integrating with Hive

Delving Deeper into Spark API

  • Pair RDDs
  • Implementing Map Reduce Algorithms using Spark
  • Ways to create Pair RDDs JSON Processing Code Example on JSON Processing
  • XML Processing
  • Joins
  • Playing with Regular Expressions
  • Log File Processing using Regular Expressions
  • Hands-on Exercises

Executing a Spark Application

  • Writing Standalone Spark Application
  • Various commands to execute and configure
  • Spark Applications in various modes
  • Discussion on Application, Job, Stage,
  • Executor, Tasks
  • Interpreting RDD Metadata/Lineage/DAG
  • Controlling degree of Parallelism in Spark Job
  • Physical execution of a Spark application
  • Discussion on: How Spark is better than
  • MapReduce?
  • Hands-on Exercises

Phase 3: Hadoop with Spark Dataframes and park SQL (Day 3 and 4)

Spark SQL

  • Dataframes in Depth
  • Creating Dataframes
  • Discussion on Different file formatORC, Sequence, Avro, Parquet and sequence
  • Dataframe internals that makes it fast – Catalyst Optimizer and Tungsten
  • Load data into Spark from external data sources like relational databases
  • Saving dataframe to external sources like HDFS, RDBMS
  • SQL features of Data frame
  • Data formats – text format such csv, json, xml, binary formats such as parquet,orc
  • UDF in Spark Dataframe
  • When to use UDF of hive or not?
  • CDC use cases
  • Spark optimization techniques-joins?
  • Integration with Teradata- use case

Understanding of Hive

  • Hive as a Data Warehouse
  • Creating Tables for Analysis of data
  • Techniques of Loading Data into Tables
  • Difference between Internal and External Tables
  • Understanding Hive Data Types Joining,Union datasets
  • Join Optimizations
  • Partitions and Bucketing
  • Running a Spark SQL Application
  • Dataframes on a JSON file
  • Dataframes on hive tables
  • Dataframes on JSON
  • Querying operations dataframes
  • Hive Writing HSQL queries for data retrieval

Phase 4: NoSQL and Cluster Walkthrough (Day 5)

Know Kafka Tool and Spark Streaming

Introduction to Kafka

  • Kafka Overview
  • Salient Features of Kafka
  • Topics, Brokers and Partitions
  • Kafka Use cases

Kafka Connect and Spark Streaming

  • Kafka Connect
  • Hands-on Exercise

Structured Streaming

  • Structured Streaming Overview
  • How it is better than Kafka streaming?
  • Hands-on Exercises integrating with Kafka using Spark Streaming

 

 

Course Customization Options

To request a customized training for this course, please contact us to arrange.

Inquire now

Best selling courses

Duration 3 days – 21 hrs   Overview    This Portfolio Management Training Course is designed to provide banking professionals with a comprehensive understanding of how to effectively manage investment...

Duration 2 days – 14 hrs   Overview   This comprehensive Planning and Forecasting Training Course is designed to empower professionals with the tools and techniques necessary to accurately predict...

Duration 2 days – 14 hrs   Overview   This hands-on course provides an introduction to Splunk, a powerful platform for searching, monitoring, and analyzing machine-generated data. The training focuses...

Duration 3 days – 21 hrs   Overview.   This course is designed for fresh graduates aspiring to build a career in Data Science. It introduces the fundamentals of data...

Among the most popular and widely implemented NoSQL databases is MongoDB. Its scalability, robustness, and flexibility have made it extremely popular among the Fortune 500 and Global 500 companies who use it to implement a variety of activities including social communications, analytics, content management, archiving, and other activities.

PROGRAMMING / CODING

ASP.NET

SP.NET is a framework for developing dynamic web applications. It supports languages like VB.Net, C#, Jscript.Net, etc. The programming logic and content can be developed separately in Microsoft Asp.Net.

CYBER SECURITY

Physical Security

Duration 3 days – 21 hrs   Overview   This course provides a comprehensive introduction to physical security principles, policies, technologies, and practices. It covers methods to assess physical risks,...

Duration 5 days – 35 hrs   Overview   This intensive 5-day course is designed for professionals seeking advanced-level skills in Microsoft SQL Server’s BI stack: SSRS (SQL Server Reporting...

We use cookies on our website to personalize your experience by storing your preferences and recognizing repeat visits. By clicking “Accept”, you agree to the use of all cookies. You can also select “Cookie Settings” to adjust your preferences and provide more specific consent. Cookie Policy