- 1. Introduction
- 2. Hadoop Fundamentals
- 3. Introduction to Pig
- 4. Basic Data Analysis with Pig
- 5. Processing Complex Data with Pig
- 6. Multi-Dataset Operations with Pig
- 7. Extending Pig
- 8. Pig Troubleshooting and Optimization
- 9. Introduction to Hive
- 10. Relational Data Analysis with Hive
- 11. Hive Data Management
- 12. Text Processing with Hive
- 13. Hive Optimization
- 14. Extending Hive
- 15. Introduction to Impala
- 16. Analyzing Data with Impala
- 17. Choosing the Best Tool for the Job
1. Introduction
- About this Course
- About Big Data
- Course Logistics
- Introductions
2. Hadoop Fundamentals
- The Motivation for Hadoop
- Hadoop Overview
- HDFS
- MapReduce
- The Hadoop Ecosystem
- Lab Scenario Explanation
- Hands-On Exercise: Data Ingest with Hadoop Tools
3. Introduction to Pig
- What Is Pig?
- Pig’s Features
- Pig Use Cases
- Interacting with Pig
4. Basic Data Analysis with Pig
- Pig Latin Syntax
- Loading Data
- Simple Data Types
- Field Definitions
- Data Output
- Viewing the Schema
- Filtering and Sorting Data
- Commonly-Used Functions
- Hands-On Exercise: Using Pig for ETL Processing
5. Processing Complex Data with Pig
- Storage Formats
- Complex/Nested Data Types
- Grouping
- Built-in Functions for Complex Data
- Iterating Grouped Data
- Hands-On Exercise: Analyzing Ad Campaign Data with Pig
6. Multi-Dataset Operations with Pig
- Techniques for Combining Data Sets
- Joining Data Sets in Pig
- Set Operations
- Splitting Data Sets
- Hands-On Exercise: Analyzing Disparate Data Sets with Pig
7. Extending Pig
- Adding Flexibility with Parameters
- Macros and Imports
- UDFs
- Contributed Functions
- Using Other Languages to Process Data with Pig
- Hands-On Exercise: Extending Pig with Streaming and UDFs
8. Pig Troubleshooting and Optimization
- Troubleshooting Pig
- Logging
- Using Hadoop’s Web UI
- Optional Demo: Troubleshooting a Failed Job with the Web UI
- Data Sampling and Debugging
- Performance Overview
- Understanding the Execution Plan
- Tips for Improving the Performance of Your Pig Jobs
9. Introduction to Hive
- What Is Hive?
- Hive Schema and Data Storage
- Comparing Hive to Traditional Databases
- Hive vs. Pig
- Hive Use Cases
- Interacting with Hive
10. Relational Data Analysis with Hive
- Hive Databases and Tables
- Basic HiveQL Syntax
- Data Types
- Joining Data Sets
- Common Built-in Functions
- Hands-On Exercise: Running Hive Queries on the Shell, Scripts, and Hue
11. Hive Data Management
- Hive Data Formats
- Creating Databases and Hive-Managed Tables
- Loading Data into Hive
- Altering Databases and Tables
- Self-Managed Tables
- Simplifying Queries with Views
- Storing Query Results
- Controlling Access to Data
- Hands-On Exercise: Data Management with Hive
12. Text Processing with Hive
- Overview of Text Processing
- Important String Functions
- Using Regular Expressions in Hive
- Sentiment Analysis and N-Grams
- Hands-On Exercise (Optional): Gaining Insight with Sentiment Analysis
13. Hive Optimization
- Understanding Query Performance
- Controlling Job Execution Plan
- Partitioning
- Bucketing
- Indexing Data
14. Extending Hive
- SerDes
- Data Transformation with Custom Scripts
- User-Defined Functions
- Parameterized Queries
- Hands-On Exercise: Data Transformation with Hive
15. Introduction to Impala
- What is Impala?
- How Impala Differs from Hive and Pig
- How Impala Differs from Relational Databases
- Limitations and Future Directions
- Using the Impala Shell
16. Analyzing Data with Impala
- Basic Syntax
- Data Types
- Filtering, Sorting, and Limiting Results
- Joining and Grouping Data
- Improving Impala Performance
- Hands-On Exercise: Interactive Analysis with Impala
17. Choosing the Best Tool for the Job
- Comparing MapReduce, Pig, Hive, Impala, and Relational Databases
- Which to Choose?