COURSE 4: PERFORM DATA SCIENCE WITH AZURE DATABRICKS

Module 5: Manage Machine Learning Lifecycles And Fine Tune Models

MICROSOFT AZURE DATA SCIENTIST ASSOCIATE (DP-100) PROFESSIONAL CERITIFICATE

Complete Coursera Study Guide

Last updated:

INTRODUCTION – Manage Machine Learning Lifecycles And Fine Tune Models

In this module, you will learn how to use MLflow to track machine learning experiments, allowing you to monitor and manage various runs and results effectively. Additionally, you’ll explore modules from Spark’s machine learning library to perform hyperparameter tuning and model selection, ensuring you can optimize your models for the best performance. These skills will enable you to streamline your machine learning workflow and enhance the accuracy of your predictive models.

Learning Objectives

  • Use MLflow to track experiments, log metrics, and compare runs
  • Understand hyperparameter tuning and its role in machine learning
  • Learn how to use modules from the PySpark’s machine learning library for hyperparameter tuning and model selection

PRACTICE QUIZ: KNOWLEDGE CHECK 1

1. What are the three core issues MLflow seeks to address? 

Select all that apply.

  • Keeping track of identity 
  • Code reproducing (CORRECT)
  • Keeping track of experiments (CORRECT)
  • The standardization of model packaging and deployment (CORRECT)

Correct: This is one of the issues MLflow is addressing.

Correct: This is one of the issues MLflow is addressing.

Correct: This is one of the issues MLflow is addressing, but there is no standard way to package and deploy models.

2. What is the MLflow Tracking tool?

  • A logging API (CORRECT)
  • An environment
  • A library
  • A class

Correct: MLflow Tracking is a logging API specific for machine learning and agnostic to libraries and environments that do the training.

3. MLflow Tracking is organized around the concept of runs, which are basically executions of data science code.

Runs are aggregated into which of the following?

  • Dataframe
  • Workflows
  • Experiments (CORRECT)
  • Datasets

Correct: Runs are aggregated into experiments where many runs can be a part of a given experiment and an MLflow server can host many experiments.

4. What information can be recorded for each run? Select all that apply.

  • Variables
  • Artifacts (CORRECT)
  • Metrics (CORRECT)
  • Source (CORRECT)
  • Parameters (CORRECT)

Correct: Arbitrary output files in any format. This can include images, pickled models, and data files.

Correct: Evaluation metrics such as RMSE or Area Under the ROC Curve.

Correct: The code that originally ran the experiment.

Correct: Key-value pairs of input parameters such as the number of trees in a random forest model.

5. Which of the following objects can be used to query past runs programmatically

  • MlflowFetcher
  • MlflowTracker
  • MlflowQuery
  • MlflowClient (CORRECT)

Correct: The MlflowClient object is the pathway to querying past runs programmatically in order to use the data back in Python.

PRACTICE QUIZ: KNOWLEDGE CHECK 2

1. What will happen to a model that has been trained and evaluated on the same data?

  • Underfitting
  • Well generalized
  • Overfitting (CORRECT)

Correct: Overfitting occurs when the model performs well on data it has already seen but fails to predict anything useful on data it has not already seen. This is the case here.

2. True or false?

A machine learning algorithm can learn hyperparameters from the data itself.

  • True
  • False (CORRECT)

Correct: A hyperparameter is a parameter used in a machine learning algorithm that is set before the learning process begins. In other words, a machine learning algorithm cannot learn hyperparameters from the data itself.

3. Which of the following best describes the process of Hyperparameter tuning?

  • The process of dropping the hyperparameters that do not perform well on the loss function of the model.
  • The process of modifying the hyperparameter until we get the best result on the loss function of the model.
  • The process of choosing the hyperparameter that performs the best on the loss function of the model. (CORRECT)

Correct: Hyperparameter tuning is the process of choosing the optimal hyperparameters for a machine learning algorithm. Each algorithm has different hyperparameters to tune.

4. When training different models with different hyperparameters and evaluating their performance, there is a risk of overfitting by choosing the hyperparameter that happens to perform best on the data found in the dataset.

Which cross-validation technique would be the best fit for solving this problem?

  • Time Series cross-validation
  • Holdout cross-validation
  • K-fold cross-validation (CORRECT)
  • Repeated random subsampling validation

Correct: This would be the best choice because in k-fold cross-validation the original dataset is equally partitioned into k subparts or folds. Out of the k-folds or groups, for each iteration, one group is selected as validation data, and the remaining (k-1) groups are selected as training data.

5. Which of the following hyperparameter optimization technique is the process of exhaustively trying every combination of hyperparameters?

  • Random Search
  • Bayesian Search
  • Grid Search (CORRECT)

Correct: Grid Search is a method wherein we try all possible combination of the set of Hyperparameters. Each combination of the Hyperparameters represent a machine learning model. Hence, N combinations represent N machine learning models. Through Grid search, we identify the model which shows the best performance.

QUIZ: TEST PREP

1. You can query previous runs programmatically by using the MlflowClient object as the pathway.

How would you code that in Python?

  • from mlflow.pipelines import MlflowClient
  • client = MlflowClient()
  • client.list_experiments()
  • from mlflow.tracking import MlflowClient (CORRECT)
  • client = MlflowClient()
  • client.list_experiments()
  • from mlflow.tracking import MlflowClient
  • client = MlflowClient()
  • list.client_experiments()
  • from mlflow.pipelines import MlflowClient
  • client = MlflowClient()
  • list.experiments()

Correct: This is the correct code syntax for this job.

2. You can also use the search_runs method to find all runs for a given experiment.

How would you code that in Python?

  • experiment_id = run.experiment_id
  • runs_df = mlflow.search_runs(experiment_id)
  • display(runs_df)
  • experiment_id = info.experiment_id
  • runs_df = mlflow.search_runs(experiment_id)
  • display(runs_df)
  • experiment = run.experiment_id 
  • runs_df = mlflow.search_runs(experiment_id)
  • display(runs_df)
  • experiment_id = run.info.experiment_id (CORRECT)
  • runs_df = mlflow.search_runs(experiment_id)
  • display(runs_df)

Correct: This is the correct code syntax.

3. You need to retrieve the last run from the list of experiments.

How would you code that in Python?

  • runs = client.search_runs(experiment_id, order_by=[“attributes.start_time asce”], max_results=1)*
  • runs[0].data.metrics
  • runs = client.search_runs(experiment_id, order_by=[“attributes.start_time”], max_results=1)
  • runs[0].data.metrics
  • runs = client.search_runs(experiment_id, order_by=[“attributes.start_time desc”], max_results=3)
  • runs[0].data.metrics
  • runs = client.search_runs(experiment_id, order_by=[“attributes.start_time desc”], max_results=1)
  • runs[0].data.metrics (CORRECT)

Correct: This is the correct code syntax.

4. Knowing that each algorithm has different hyperparameter available for tuning, which method can you use to explore the hyperparameters on a model?

  • exploreParams()
  • showParams()
  • explainParams() (CORRECT)
  • getParams()

Correct: You can explore these hyperparameters by using the .explainParams() method on a model.

5. Which method from the PySpark class can you use to string together all the different possible hyperparameters you want to test?

  • ParamGridBuilder() (CORRECT)
  • ParamSearch()
  • ParamGridSearch()
  • ParamBuilder()

Correct: ParamGridBuilder() allows you to string together all of the different possible hyperparameters you would like to test. In this case, you can test the maximum number of iterations, whether you want to use an intercept with the y axis, and whether you want to standardize our features.

6. Which of the following belong to the exhaustive type of cross-validation techniques?

  • K-fold cross-validation
  • Holdout cross-validation
  • Leave-p-out cross-validation (CORRECT)
  • Leave-one-out cross-validation (CORRECT)

Correct: Leave-p-out cross-validation (LpO CV) is an exhaustive type of cross-validation technique. It involves using p observations as the validation set and the remaining observations as the training set. This is repeated on all ways to cut the original sample on a validation set of p observations and a training set.

Correct: Leave-one-out cross-validation (LOOCV) is a particular case of leave-p-out cross-validation with p = 1, which makes it an exhaustive type of cross-validation.

7. In which of the following non-exhaustive cross validation techniques do you randomly assign data points to the training set and the test set?

  • K-fold cross-validation
  • Holdout cross-validation (CORRECT)
  • Repeated random sub-sampling validation

Correct: In the holdout method, you randomly assign data points to two sets d0 and d1, usually called the training set and the test set, respectively. The size of each of the sets is arbitrary although typically the test set is smaller than the training set. You then train (build a model) on d0 and test (evaluate its performance) on d1.

CONCLUSION – Manage Machine Learning Lifecycles And Fine Tune Models

In conclusion, this module has equipped you with the skills to use MLflow for tracking machine learning experiments, ensuring efficient management and monitoring of runs and results. You have also learned how to leverage Spark’s machine learning library for hyperparameter tuning and model selection, enabling you to optimize your models for superior performance. These capabilities will streamline your machine learning workflow and enhance the accuracy and reliability of your predictive models.