Cloudera Data Analyst Training (CDAT) – Outline

Detailed Course Outline

Introduction
Apache Hadoop Fundamentals
  • The Motivation for Hadoop
  • Hadoop Overview
  • Data Storage: HDFS
  • Distributed Data Processing: YARN, MapReduce, and Spark
  • Data Processing and Analysis: Pig, Hive, and Impala
  • Database Integration: Sqoop
  • Other Hadoop Data Tools
  • Exercise Scenario Explanation
Introduction to Apache Hive and Impala
  • What Is Hive?
  • What Is Impala?
  • Why Use Hive and Impala?
  • Schema and Data Storage
  • Comparing Hive and Impala to Traditional Databases
  • Use Cases
Querying with Apache Hive and Impala
  • Databases and Tables
  • Basic Hive and Impala Query Language Syntax
  • Data Types
  • Using Hue to Execute Queries
  • Using Beeline (Hive's Shell)
  • Using the Impala Shell
Common Operators and Built-In Functions
  • Operators
  • Scalar Functions
  • Aggregate Functions
Data Management
  • Data Storage
  • Creating Databases and Tables
  • Loading Data
  • Altering Databases and Tables
  • Simplifying Queries with Views
  • Storing Query Results
Data Storage and Performance
  • Partitioning Tables
  • Loading Data into Partitioned Tables
  • When to Use Partitioning
  • Choosing a File Format
  • Using Avro and Parquet File Formats
Working with Multiple Datasets
  • UNION and Joins
  • Handling NULL Values in Joins
  • Advanced Joins
Analytic Functions and Windowing
  • Using Common Analytic Functions
  • Other Analytic Functions
  • Sliding Windows
Complex Data
  • Complex Data with Hive
  • Complex Data with Impala
Analyzing Text
  • Using Regular Expressions with Hive and Impala
  • Processing Text Data with SerDes in Hive
  • Sentiment Analysis and n-grams
Apache Hive Optimization
  • Understanding Query Performance
  • Bucketing
  • Hive on Spark
Apache Impala Optimization
  • How Impala Executes Queries
  • Improving Impala Performance
Extending Apache Hive and Impala
  • Custom SerDes and File Formats in Hive
  • Data Transformation with Custom Scripts in Hive
  • User-Defined Functions
  • Parameterized Queries
Choosing the Best Tool for the Job
  • Comparing Hive, Impala, and Relational Databases
  • Which to Choose?
Conclusion