Detailed Course Outline
HDFS Introduction
- HDFS Overview
- HDFS Components and Interactions
- Additional HDFS Interactions
- Ozone Overview
- Exercise: Working with HDFS
YARN Introduction
- YARN Overview
- YARN Components and Interaction
- Working with YARN
- Exercise: Working with YARN
Working with RDDs
- Resilient Distributed Datasets (RDDs)
- Exercise: Working with RDDs
Working with DataFrames
- Introduction to DataFrames
- Exercise: Introducing DataFrames
- Exercise: Reading and Writing DataFrames
- Exercise: Working with Columns
- Exercise: Working with Complex Types
- Exercise: Combining and Splitting DataFrames
- Exercise: Summarizing and Grouping DataFrames
- Exercise: Working with UDFs
- Exercise: Working with Windows
Introduction to Apache Hive
- About Hive
- Transforming data with Hive QL
Working with Apache Hive
- Exercise: Working with Partitions
- Exercise: Working with Buckets
- Exercise: Working with Skew
- Exercise: Using Serdes to Ingest Text Data
- Exercise: Using Complex Types to Denormalize Data
Hive and Spark Integration
- Hive and Spark Integration
- Exercise: Spark Integration with Hive
Distributed Processing Challenges
- Shuffle
- Skew
- Order
Spark Distributed Processing
- Spark Distributed Processing
- Exercise: Explore Query Execution Order
Spark Distributed Persistence
- DataFrame and Dataset Persistence
- Persistence Storage Levels
- Viewing Persisted RDDs
- Exercise: Persisting DataFrames
Data Engineering Service
- Create and Trigger Ad-Hoc Spark Jobs
- Orchestrate a Set of Jobs Using Airflow
- Data Lineage using Atlas
- Auto-scaling in Data Engineering Service
Workload XM
- Optimize Workloads, Performance, Capacity
- Identify Suboptimal Spark Jobs
Appendix: Working with Datasets in Scala
- Working with Datasets in Scala
- Exercise: Using Datasets in Scala