Detailed Course Outline
Introduction
Apache Hadoop Fundamentals
- The Motivation for Hadoop
- Hadoop Overview
- Data Storage: HDFS
- Distributed Data Processing: YARN, MapReduce, and Spark
- Data Processing and Analysis: Pig, Hive, and Impala
- Database Integration: Sqoop
- Other Hadoop Data Tools
- Exercise Scenarios
Introduction to Apache Pig
- What is Pig?
- Pig’s Features
- Pig Use Cases
- Interacting with Pig
Basic Data Analysis with Apache Pig
- Pig Latin Syntax
- Loading Data
- Simple Data Types
- Field Definitions
- Data Output
- Viewing the Schema
- Filtering and Sorting Data
- Commonly Used Functions
Processing Complex Data with Apache Pig
- Storage Formats
- Complex/Nested Data Types
- Grouping
- Built-In Functions for Complex Data
- Iterating Grouped Data
Multi-Dataset Operations with Apache Pig
- Techniques for Combining Datasets
- Joining Datasets in Pig
- Set Operations
- Splitting Datasets
Apache Pig Troubleshooting and Optimization
- Troubleshooting Pig
- Logging
- Using Hadoop’s Web UI
- Data Sampling and Debugging
- Performance Overview
- Understanding the Execution Plan
- Tips for Improving the Performance of Pig Jobs
Introduction to Apache Hive and Impala
- What is Hive?
- What is Impala?
- Why Use Hive and Impala?
- Schema and Data Storage
- Comparing Hive and Impala to Traditional Databases
- Use Cases
Querying with Apache Hive and Impala
- Databases and Tables
- Basic Hive and Impala Query Language Syntax
- Data Types
- Using Hue to Execute Queries
- Using Beeline (Hive’s Shell)
- Using the Impala Shell
Apache Hive and Impala Data Management
- Data Storage
- Creating Databases and Tables
- Loading Data
- Altering Databases and Tables
- Simplifying Queries with Views
- Storing Query Results
Data Storage and Performance
- Partitioning Tables
- Loading Data into Partitioned Tables
- When to Use Partitioning
- Choosing a File Format
- Using Avro and Parquet File Formats
Relational Data Analysis with Apache Hive and Impala
- Joining Datasets
- Common Built-In Functions
- Aggregation and Windowing
Complex Data with Apache Hive and Impala
- Complex Data with Hive
- Complex Data with Impala
Analyzing Text with Apache Hive and Impala
- Using Regular Expressions with Hive and Impala
- Processing Text Data with SerDes in Hive
- Sentiment Analysis and n-grams in Hive
Apache Hive Optimization
- Understanding Query Performance
- Bucketing
- Indexing Data
- Hive on Spark
Apache Impala Optimization
- How Impala Executes Queries
- Improving Impala Performance
Extending Apache Hive and Impala
- Custom SerDes and File Formats in Hive
- Data Transformation with
- Custom Scripts in Hive
- User-Defined Functions
- Parameterized Queries
Choosing the Best Tool for the Job
- Comparing Pig, Hive, Impala, and Relational Databases
- Which to Choose?