- 1. Meet Hadoop
- 2. MapReduce
- 3. The Hadoop Distributed File System (HDFS
- 4. Hadoop I/O
- 5. Developing a MapReduce Application
- 6. How MapReduce Works
- 7. MapReduce Types and Formats
- 8. MapReduce Features
- 9. Setting Up a Hadoop Cluster
- 10. Administering Hadoop
- 11. Pig
- 12. Hive
- 13. Hbase
- 14. ZooKeeper
- 15. Sqoop
- 16. Flume
1. Meet Hadoop
- Data
- Data Storage and Analysis
- Comparison with Other Systems
- RDBMS
- Grid Computing
- Volunteer Computing
- A Brief History of Hadoop
- Apache Hadoop and the Hadoop Ecosystem
- Hadoop Releases
2. MapReduce
- A Weather Dataset
- Data Format
- Analyzing the Data with Unix Tools
- Analyzing the Data with Hadoop
- Map and Reduce
- Java MapReduce
- Scaling Out
- Data Flow
- Combiner Functions
- Running a Distributed MapReduce Job
- Hadoop Streaming
- Compiling and Running
3. The Hadoop Distributed File System (HDFS)
- The Design of HDFS
- HDFS Concepts
- Blocks
- Namenodes and Datanodes
- HDFS Federation
- HDFS High-Availability
- The Command-Line Interface
- Basic Filesystem Operations
- Hadoop Filesystems
- Interfaces
- The Java Interface
- Reading Data from a Hadoop URL
- Reading Data Using the FileSystem API
- Writing Data
- Directories
- Querying the Filesystem
- Deleting Data
- Data Flow
- Anatomy of a File Read
- Anatomy of a File Write
- Coherency Model
- Parallel Copying with distcp
- Keeping an HDFS Cluster Balanced
- Hadoop Archives
4. Hadoop I/O
- Data Integrity
- Data Integrity in HDFS
- LocalFileSystem
- ChecksumFileSystem
- Compression
- Codecs
- Compression and Input Splits
- Using Compression in MapReduce
- Serialization
- The Writable Interface
- Writable Classes
- File-Based Data Structures
- SequenceFile
- MapFile
5. Developing a MapReduce Application
- The Configuration API
- Combining Resources
- Variable Expansion
- Configuring the Development Environment
- Managing Configuration
- GenericOptionsParser, Tool, and ToolRunner
- Writing a Unit Test
- Mapper
- Reducer
- Running Locally on Test Data
- Running a Job in a Local Job Runner
- Testing the Driver
- Running on a Cluster
- Packaging
- Launching a Job
- The MapReduce Web UI
- Retrieving the Results
- Debugging a Job
- Hadoop Logs
- Tuning a Job
- Profiling Tasks
- MapReduce Workflows
- Decomposing a Problem into MapReduce Jobs
- JobControl
6. How MapReduce Works
- Anatomy of a MapReduce Job Run
- Classic MapReduce (MapReduce 1)
- Failures
- Failures in Classic MapReduce
- Failures in YARN
- Job Scheduling
- The Capacity Scheduler
- Shuffle and Sort
- The Map Side
- The Reduce Side
- Configuration Tuning
- Task Execution
- The Task Execution Environment
- Speculative Execution
- Output Committers
- Task JVM Reuse
- Skipping Bad Records
7. MapReduce Types and Formats
- MapReduce Types
- The Default MapReduce Job
- Input Formats
- Input Splits and Records
- Text Input
- Binary Input
- Multiple Inputs
- Database Input (and Output)
- Output Formats
- Text Output
- Binary Output
- Multiple Outputs
- Lazy Output
- Database Output
8. MapReduce Features
- Counters
- Built-in Counters
- User-Defined Java Counters
- User-Defined Streaming Counters
- Sorting
- Preparation
- Partial Sort
- Total Sort
- Secondary Sort
- Joins
- Map-Side Joins
- Reduce-Side Joins
- Side Data Distribution
- Using the Job Configuration
- Distributed Cache
- MapReduce Library Classes
9. Setting Up a Hadoop Cluster
- Cluster Specification
- Network Topology
- Cluster Setup and Installation
- Installing Java
- Creating a Hadoop User
- Installing Hadoop
- Testing the Installation
- SSH Configuration
- Hadoop Configuration
- Configuration Management
- Environment Settings
- Important Hadoop Daemon Properties
- Hadoop Daemon Addresses and Ports
- Other Hadoop Properties
- User Account Creation
- YARN Configuration
- Important YARN Daemon Properties
- YARN Daemon Addresses and Ports
- Security
- Kerberos and Hadoop
- Delegation Tokens
- Other Security Enhancements
- Benchmarking a Hadoop Cluster
- Hadoop Benchmarks
- User Jobs
- Hadoop in the Cloud
- Hadoop on Amazon EC2
10. Administering Hadoop
- HDFS
- Persistent Data Structures
- Safe Mode
- Audit Logging
- Tools
- Monitoring
- Logging
- Metrics
- Java Management Extensions
- Routine Administration Procedures
- Commissioning and Decommissioning Nodes
- Upgrades
11. Pig
- Installing and Running Pig
- Execution Types
- Running Pig Programs
- Grunt
- Pig Latin Editors
- An Example
- Generating Examples
- Comparison with Databases
- Pig Latin
- Structure
- Statements
- Expressions
- Types
- Schemas
- Functions
- Macros
- User-Defined Functions
- A Filter UDF
- An Eval UDF
- A Load UDF
- Data Processing Operators
- Loading and Storing Data
- Filtering Data
- Grouping and Joining Data
- Sorting Data
- Combining and Splitting Data
- Pig in Practice
- Parallelism
- Parameter Substitution
12. Hive
- Installing Hive
- The Hive Shell
- An Example
- Running Hive
- Configuring Hive
- Hive Services
- Comparison with Traditional Databases
- Schema on Read Versus Schema on Write
- Updates, Transactions, and Indexes
- HiveQL
- Data Types
- Operators and Functions
- Tables
- Managed Tables and External Tables
- Partitions and Buckets
- Storage Formats
- Importing Data
- Altering Tables
- Dropping Tables
- Querying Data
- Sorting and Aggregating
- MapReduce Scripts
- Joins
- Subqueries
- Views
- User-Defined Functions
- Writing a UDF
- Writing a UDAF
13. Hbase
- Backdrop
- Concepts
- Whirlwind Tour of the Data Model
- Implementation
- Installation
- Test Drive
- Clients
- Java
- Avro, REST, and Thrift
- Schemas
- Loading Data
- Web Queries
- HBase Versus RDBMS
- Successful Service
- Hbase
14. ZooKeeper
- Installing and Running ZooKeeper
- Group Membership in ZooKeeper
- Creating the Group
- Joining a Group
- Listing Members in a Group
- Deleting a Group
- The ZooKeeper Service
- Data Model
- Operations
- Implementation
- Consistency
- Sessions
- States
15. Sqoop
- Getting Sqoop
- A Sample Import
- Generated Code
- Additional Serialization Systems
- Database Imports: A Deeper Look
- Controlling the Import
- Imports and Consistency
- Direct-mode Imports
- Working with Imported Data
- Imported Data and Hive
- Importing Large Objects
16. Flume
- Introduction
- Overview
- Architecture
- Data flow model
- Reliability
- Building Flume
- Getting the source
- Compile/test Flume
- Developing custom components
- Client
- Client SDK
- RPC client interface
- RPC clients - Avro and Thrift
- Failover Client
- Load Balancing RPC client
- Embedded agent
- Transaction interface
- Sink
- Source
- Channel
- Client