Syllabus of PySpark Certification Course in Bangalore
Module 1: Introduction to Big Data Hadoop and Spark
- 1. What is Big Data?
- 2. Big Data Customer Scenarios
- 3. Limitations and Solutions of Existing Data Analytics Architecture with Uber Use Case
- 4. How Hadoop Solves the Big Data Problem?
- 5. What is Hadoop?
- 6. Hadoop’s Key Characteristics
- 7. Hadoop Ecosystem and HDFS
- 8. Hadoop Core Components
- 9. Rack Awareness and Block Replication
- 10. YARN and its Advantage
- 11. Hadoop Cluster and its Architecture
- 12. Hadoop: Different Cluster Modes
- 13. Big Data Analytics with Batch & Real-Time Processing
- 14. Why Spark is Needed?
- 15. What is Spark?
- 16. How Spark Differs from its Competitors?
- 17. Spark at eBay
- 18. Spark’s Place in Hadoop Ecosystem
Module 2: Introduction to Python for Apache Spark
- 1. Overview of Python
- 2. Different Applications where Python is Used
- 3. Values, Types, Variables
- 4. Operands and Expressions
- 5. Conditional Statements
- 6. Loops
- 7. Command Line Arguments
- 8. Writing to the Screen
- 9. Python files I/O Functions
- 10. Numbers
- 11. Strings and related operations
- 12. Tuples and related operations
- 13. Lists and related operations
- 14. Dictionaries and related operations
- 15. Sets and related operations
Module 3: Functions, OOPs, and Modules in Python
- 1. Functions
- 2. Function Parameters
- 3. Global Variables
- 4. Variable Scope and Returning Values
- 5. Lambda Functions
- 6. Object-Oriented Concepts
- 7. Standard Libraries
- 8. Modules Used in Python
- 9. The Import Statements
- 10. Module Search Path
- 11. Package Installation Way
Module 4: Deep Dive into Apache Spark Framework
- 1. Spark Components & its Architecture
- 2. Spark Deployment Modes
- 3. Introduction to PySpark Shell
- 4. Submitting PySpark Job
- 5. Spark Web UI
- 6. Writing your first PySpark Job Using Jupyter Notebook
- 7. Data Ingestion using Sqoop
Module 5: Playing with Spark RDDs
- 1. Challenges in Existing Computing Methods
- 2. Probable Solution & How RDD Solves the Problem
- 3. What is RDD, It’s Operations, Transformations & Actions
- 4. Data Loading and Saving Through RDDs
- 5. Key-Value Pair RDDs
- 6. Other Pair RDDs, Two Pair RDDs
- 7. RDD Lineage
- 8. RDD Persistence
- 9. WordCount Program Using RDD Concepts
- 10. RDD Partitioning & How it Helps Achieve Parallelization
- 11. Passing Functions to Spark
Module 6: DataFrames and Spark SQL
- 1. Need for Spark SQL
- 2. What is Spark SQL
- 3. Spark SQL Architecture
- 4. SQL Context in Spark SQL
- 5. Schema RDDs
- 6. User Defined Functions
- 7. Data Frames & Datasets
- 8. Interoperating with RDDs
- 9. JSON and Parquet File Formats
- 10. Loading Data through Different Sources
- 11. Spark-Hive Integration
Module 7: Machine Learning using Spark MLlib
- 1. Why Machine Learning
- 2. What is Machine Learning
- 3. Where Machine Learning is used
- 4. Different Types of Machine Learning Techniques
- 5. Introduction to MLlib
- 6. Features of MLlib and MLlib Tools
- 7. Various ML algorithms supported by MLlib
Module 8: Deep Dive into Spark MLlib
- 1. Supervised Learning: Linear Regression, Logistic Regression, Decision Tree, Random Forest
- 2. Unsupervised Learning: K-Means Clustering & How It Works with MLlib
- 3. Analysis of US Election Data using MLlib (K-Means)
Module 9: Understanding Apache Kafka and Apache Flume
- 1. Need for Kafka
- 2. What is Kafka
- 3. Core Concepts of Kafka
- 4. Kafka Architecture
- 5. Where is Kafka Used
- 6. Understanding the Components of Kafka Cluster
- 7. Configuring Kafka Cluster
- 8. Kafka Producer and Consumer Java API
- 9 Need of Apache Flume
- 10. What is Apache Flume
- 11. Basic Flume Architecture
- 12. Flume Sources
- 13. Flume Sinks
- 14. Flume Channels
- 15. Flume Configuration
- 16. Integrating Apache Flume and Apache Kafka
Module 10: Apache Spark Streaming - Processing Multiple Batches
- 1. Drawbacks in Existing Computing Methods
- 2. Why Streaming is Necessary
- 3 .What is Spark Streaming
- 4. Spark Streaming Features
- 5. Spark Streaming Workflow
- 6. How Uber Uses Streaming Data
- 7. Streaming Context & DStreams
- 8. Transformations on DStreams
- 9. Describe Windowed Operators and Why it is Useful
- 10. Important Windowed Operators
- 11. Slice, Window and ReduceByWindow Operators
- 12. Stateful Operators
Module 11: Apache Spark Streaming - Data Sources
- 1. Apache Spark Streaming: Data Sources
- 2. Streaming Data Source Overview
- 3. Apache Flume and Apache Kafka Data Sources
- 4. Example: Using a Kafka Direct Data Source
Module 12: Spark GraphX (Self-Paced)
- 1. Introduction to Spark GraphX
- 2. Information about a Graph
- 3. GraphX Basic APIs and Operations
- 4. Spark GraphX Algorithm - PageRank, Personalized PageRank, Triangle Count, Shortest Paths, Connected Components, Strongly Connected Components, Label Propagation