Module 1: Translating business challenges into ML use cases
- Choosing the best solution (ML vs. non-ML, custom vs. pre-packaged
- Defining how the model output should be used to solve the business problem
- Deciding how incorrect results should be handled
- Identifying data sources
Module 2: Defining ML problems
- Problem type
- Outcome of model predictions
- Input (features) and predicted output format
Module 3: Defining business success criteria
- Alignment of ML success metrics to the business problem
- Key results
- Determining when a model is deemed unsuccessful
Module 4: Identifying risks to feasibility of ML solutions
- Assessing and communicating business impact
- Assessing ML solution readiness
- Assessing data readiness and potential limitations
- Aligning with Google's Responsible AI practices
Module 5: Designing reliable, scalable, and highly available ML solutions
- Choosing appropriate ML services for the use case
- Component types
- Exploration/analysis
- Feature engineering
- Logging/management
- Automation
- Orchestration
- Monitoring
- Serving
Module 6: Choosing appropriate Google Cloud hardware components
- Evaluation of compute and accelerator options
Module 7: Designing architecture that complies with security concerns across sectors/industries
- Building secure ML systems
- Privacy implications of data usage and/or collection
Module 8: Exploring data (EDA)
- Visualization
- Statistical fundamentals at scale
- Evaluation of data quality and feasibility
- Establishing data constraints
Module 9: Building data pipelines
- Organizing and optimizing training datasets
- Data validation
- Handling missing data
- Handling outliers
- Data leakage
Module 10: Creating input features (feature engineering)
- Ensuring consistent data pre-processing between training and serving
- Encoding structured data types
- Feature selection
- Class imbalance
- Feature crosses
- Transformations (TensorFlow Transform)
Module 11: Building models
- Choice of framework and model
- Modeling techniques given interpretability requirements
- Transfer learning
- Data augmentation
- Semi-supervised learning
- Model generalization and strategies to handle overfitting and underfitting
Module 12: Training models
- Ingestion of various file types into training
- Training a model as a job in different environments
- Hyperparameter tuning
- Tracking metrics during training
- Retraining/redeployment evaluation
Module 13: Testing models
- Unit tests for model training and serving
- Model performance against baselines, simpler models, and across the time dimension
- Model explainability on AI Platform
Module 14: Scaling model training and serving
- Distributed training
- Scaling prediction service
Module 15: Designing and implementing training pipelines
- Identification of components, parameters, triggers, and compute needs
- Orchestration framework
- Hybrid or multicloud strategies
- System design with TFX components/Kubeflow DSL
Module 16: Implementing serving pipelines
- Serving (online, batch, caching)
- Google Cloud serving options
- Testing for target performance
- Configuring trigger and pipeline schedules
Module 17: Tracking and auditing metadata
- Organizing and tracking experiments and pipeline runs
- Hooking into model and dataset versioning
- Model/dataset lineage
Module 18: Monitoring and troubleshooting ML solutions
- Performance and business quality of ML model predictions
- Logging strategies
- Establishing continuous evaluation metrics
- Understanding Google Cloud permissions model
- Identification of appropriate retraining policy
- Common training and serving errors (TensorFlow)
- ML model failure and resulting biases
Module 19: Tuning performance of ML solutions for training and serving in production
- Optimization and simplification of input pipeline for training
- Simplification techniques