Detaillierter Kursinhalt
Module 1 - Introduction to Building Batch Data Pipelines
Topics:
- EL, ELT, ETL
- Quality considerations
- How to conduct operations in BigQuery
- Shortcomings
- ETL to solve data quality issues
Objectives:
- Review different methods of loading data into your data lakes and warehouses: EL, ELT and ETL
Module 2 - Executing Spark on Dataproc
Topics:
- The Hadoop ecosystem
- Run Hadoop on Dataproc
- Cloud Storage instead of HDFS
- Optimizing Dataproc
Objectives:
- Review the Hadoop ecosystem.
- Discuss how to lift and shift your existing Hadoop workloads to the cloud using Dataproc.
- Explain when to use Cloud Storage instead of HDFS storage.
- Explain how to optimize your Dataproc jobs.
Module 3 - Serverless Data Processing with Dataflow
Topics:
- Introduction to Dataflow
- Why customers value Dataflow
- Dataflow pipelines
- Aggregate with GroupByKey and Combine
- Side inputs and windows
- Dataflow templates
Objectives:
- Identify the features that customers value in Dataflow.
- Discuss core concepts in Dataflow.
- Review the use of Dataflow templates and SQL.
- Write a simple Dataflow pipeline and run it both locally and on the cloud.
- Identify map and reduce operations, execute the pipeline, and use command line parameters.
- Read data from BigQuery into Dataflow and use the output of a pipeline as a sideinput to another pipeline
Module 4 - Manage Data Pipelines with Cloud Data Fusion and Cloud Composer
Topics:
- Building batch data pipelines visually with Cloud Data Fusion
- Components
- UI overview
- Building a pipeline
- Exploring data using Wrangler
- Orchestrating work between Google Cloud services with Cloud Composer
- Apache Airflow environment
- DAGs and operators
- Workflow scheduling
- Monitoring and logging
Objectives:
- Discuss how to manage your data pipelines with Data Fusion and Cloud Composer.
- Summarize how Cloud Data Fusion allows data analysts and ETL developers to wrangle data and build pipelines in a visual way.
- Describe how Cloud Composer can help to orchestrate the work across multiple Google Cloud services.