Course Overview:
Hadoop Development, Administration and BI Program is a one-stop course that introduces you to the domain of Hadoop development as well as gives you technical knowhow of the same and it is the most popular Big Data processing framework. Spark is a general-purpose distributed data processing engine that is suitable for use in a wide range of circumstances. On top of the Spark core data processing engine, there are libraries for SQL, machine learning, graph computation, and stream processing, which can be used together in an application.
Course Objectives:
- Learn the basics of Big Data and Spark
- Play with Hadoop and Hadoop Ecosystem
- Become a top-notch Spark Developer
Pre-requisites:
- Typically, professionals with basic knowledge of software development, programming languages, and databases will find this course helpful. Basic knowledge should be enough to succeed at this course
- Not For: Students who are absolute beginners at software development as a discipline will find it difficult to follow the course
Target Audience:
- Developers
- Data Analysts
Course Duration:
- 35 hours – 3 days
Course Content:
Phase 1: Hadoop Fundamentals with single node setup (Day 1)
Laying the foundation
Introduction to Hadoop and Spark
- Ecosystem
- Big Data Overview
- Key Roles in Big Data Project
- Key Business Use cases
- Hadoop and Spark Logical Architecture
- Typical Big Data Project Pipeline
Basic Concepts of HDFS
- HDFS Overview
- Physical Architectures of HDFS
- The Hadoop Distributed File System Hands-on.
Hadoop Ecosystem
- Sqoop
- Hive
Introduction to Sqoop
- What is sqoop
- Import / Import all tables Data
- Sqoop Job/Eval and Sqoop Code-gen
- List databases/tables
Hadoop Hands On
- Running HDFS commands
- Running Sqoop Import and Sqoop Export
Introduction to Spark
- Spark Overview
- Detailed discussion on “Why Spark”
- Quick Recap of MapReduce
- Spark vs MapReduce
- Why Python for Spark
- Just Enough Python for Spark
- Understanding of CDH Spark and Apache Spark
Phase 2: Hadoop Development (Day 2)
Become a Pro developer with Spark and Hive Datawarehouse
Spark Core Framework and API
- High level Spark Architecture
- Role of Executor, Driver, SparkSession etc.
- Resilient Distributed Datasets
- Basic operations in Spark Core API i.e.
- Actions and Transformations
- Using the Spark REPL for performing interactive data analysis
- Hands-on Exercises
- Integrating with Hive
Delving Deeper into Spark API
- Pair RDDs
- Implementing Map Reduce Algorithms using Spark
- Ways to create Pair RDDs JSON Processing Code Example on JSON Processing
- XML Processing
- Joins
- Playing with Regular Expressions
- Log File Processing using Regular Expressions
- Hands-on Exercises
Executing a Spark Application
- Writing Standalone Spark Application
- Various commands to execute and configure
- Spark Applications in various modes
- Discussion on Application, Job, Stage,
- Executor, Tasks
- Interpreting RDD Metadata/Lineage/DAG
- Controlling degree of Parallelism in Spark Job
- Physical execution of a Spark application
- Discussion on: How Spark is better than
- MapReduce?
- Hands-on Exercises
Phase 3: Hadoop with Spark Dataframes and park SQL (Day 3 and 4)
Spark SQL
- Dataframes in Depth
- Creating Dataframes
- Discussion on Different file formatORC, Sequence, Avro, Parquet and sequence
- Dataframe internals that makes it fast – Catalyst Optimizer and Tungsten
- Load data into Spark from external data sources like relational databases
- Saving dataframe to external sources like HDFS, RDBMS
- SQL features of Data frame
- Data formats – text format such csv, json, xml, binary formats such as parquet,orc
- UDF in Spark Dataframe
- When to use UDF of hive or not?
- CDC use cases
- Spark optimization techniques-joins?
- Integration with Teradata- use case
Understanding of Hive
- Hive as a Data Warehouse
- Creating Tables for Analysis of data
- Techniques of Loading Data into Tables
- Difference between Internal and External Tables
- Understanding Hive Data Types Joining,Union datasets
- Join Optimizations
- Partitions and Bucketing
- Running a Spark SQL Application
- Dataframes on a JSON file
- Dataframes on hive tables
- Dataframes on JSON
- Querying operations dataframes
- Hive Writing HSQL queries for data retrieval
Phase 4: NoSQL and Cluster Walkthrough (Day 5)
Know Kafka Tool and Spark Streaming
Introduction to Kafka
- Kafka Overview
- Salient Features of Kafka
- Topics, Brokers and Partitions
- Kafka Use cases
Kafka Connect and Spark Streaming
- Kafka Connect
- Hands-on Exercise
Structured Streaming
- Structured Streaming Overview
- How it is better than Kafka streaming?
- Hands-on Exercises integrating with Kafka using Spark Streaming