COURSE 3: BUILD AND OPERATE MACHINE LEARNING SOLUTIONS WITH AZURE

Module 2: Work With Data And Compute In Azure Machine Learning 

MICROSOFT AZURE DATA SCIENTIST ASSOCIATE (DP-100) PROFESSIONAL CERTIFICATE

Complete Coursera Study Guide

Last updated:

INTRODUCTION – Work With Data And Compute In Azure Machine Learning 

Data is the foundation of machine learning, providing the essential inputs for building accurate models. In this module, you will learn to work with datastores and datasets in Azure Machine Learning, enabling you to construct scalable, cloud-based model training solutions.

Additionally, you will explore how to use cloud compute resources in Azure Machine Learning to run training experiments at scale. By the end of this module, you will have the skills to integrate data management and cloud computing in Azure Machine Learning, building robust and scalable machine learning solutions.

Learning Objectives

  • Create and use datastores.
  • Create and use datasets.
  • Create and use environments.
  • Create and use compute targets.

PRACTICE QUIZ: KNOWLEDGE CHECK

1. When planning for datastores, which datafiles format should perform better?

  • XLS
  • XML
  • CVS
  • Parquet (CORRECT)

Correct: When working with data files, although CSV format is very common, Parquet format generally results in better performance.

2. True or False?

You cannot access datastores by name.

  • True
  • False (CORRECT)

Correct: You can access any datastore by name, but you may want to consider changing the default datastore (which is initially the built-in workspaceblobstore datastore).

3. If you want to change the default datastore, what method should you use?

  • change_default_datastore()
  • new_default_datastore()
  • set_default_datastore() (CORRECT)
  • modify_default_datastore()

Correct: You’ll use this method to change the default datastore.

4. What types of datasets can be created in Azure Machine Learning? Select all that apply.

  • Notebook
  • Media
  • File (CORRECT)
  • Tabular (CORRECT)

Correct: The dataset presents a list of file paths that can be read as though from the file system.

Correct: The data is read from the dataset as a table.

5. To create a tabular dataset using the SDK, which method of the Dataset.Tabular class should you use?

  • from_tabular_dataset
  • from_tabular_files
  • from_delimited_files (CORRECT)
  • from_files_tabular

Correct: Use this method to create a tabular dataset from the Dataset.Tabular class.

PRACTICE QUIZ: KNOWLEDGE CHECK

1. Which package managers are usually used in the installation of a Python virtual environment? Select all that apply.

  • pandas
  • numpy
  • pip (CORRECT)
  • conda (CORRECT)

Correct: This is one of the usual package manager used in the context of a virtual environment for a Python runtime.

Correct: This is one of the usual package manager used in the context of a virtual environment for a Python runtime.

2. You saved a specification file named conda.yml and you want to use it to create an Azure ML environment.

Which SDK command does the job?

  • from azureml.core import Environment
  • env = Environment.from_conda_specification(name=’training_environment’,
  •  file_path=’conda.yml’)
  • from azureml.core import Environment (CORRECT)
  • env = Environment.from_conda_specification(name=’training_environment’,
  • file_path=’/conda.yml’)
  • from azureml.core import Environment
  • env = Environment.from_conda_specification(name=’training_environment’,
  •  file_path=’*conda.yml’)
  • from azureml.core import Environment
  • env = Environment.from_conda_specification(name=’training_environment’,
  • file_path=’./conda.yml’)

Correct: This is the correct command for this job.

3. You want to create an Azure ML environment by specifying the packages you need.

Which SDK commands does the job?

  • from azureml.core import Environment
  • from azureml.core.conda_dependencies import CondaDependencies
  • env = Environment(‘training_environment’)
  • deps = CondaDependencies.create(conda_packages=[‘scikit-learn’,’pandas’,’numpy’],
  • pip_packages=[‘azureml-defaults’])
  • env.python.conda_dependencies = deps
  • from azureml.core import Environment (CORRECT)
  • from azureml.core.conda_dependencies import CondaDependencies
  • env = Environment(‘training_environment’)
  • deps = CondaDependencies.deploy(conda_packages=[‘scikit-learn’,’pandas’,’numpy’],
  • pip_packages=[‘azureml-defaults’])
  • env.python.conda_dependencies = deps
  • from azureml.core import Environment
  • env = Environment(‘training_environment’)
  • deps = CondaDependencies.create(conda_packages=[‘scikit-learn’,’pandas’,’numpy’],
  • pip_packages=[‘azureml-defaults’])
  • env.python.conda_dependencies = deps
  • from azureml.core import Environment
  • from azureml.core.conda_dependencies import Conda
  • env = Environment(‘training_environment’)
  • deps = CondaDependencies.create(conda_packages=[‘scikit-learn’,’pandas’,’numpy’],
  • pip_packages=[‘azureml-defaults’])
  • env.python.conda_dependencies = deps

Correct: These are the correct commands. You can define an environment by specifying the Conda and pip packages you need in a CondaDependencies object.

4. If you are running a notebook experiment on an Azure Machine Learning compute instance, what type of compute are you using?

  • Attached compute
  • Compute clusters
  • Local compute (CORRECT)

Correct: This runs the experiment on the same compute target as the code used to initiate the experiment, which may be your physical workstation or a virtual machine such as an Azure Machine Learning compute instance on which you are running a notebook.

5. If you have an Azure Databricks cluster that you want to use for experiment running and model training, which type of compute target is this?

  • Managed
  • Unmanaged (CORRECT)

Correct: An unmanaged compute target is one that is defined and managed outside of the Azure Machine Learning workspace; for example, an Azure virtual machine or an Azure Databricks cluster.

QUIZ: START PREP

1. Which Python commands should you use to create and register a tabular dataset using the from_delimited_files method of the Dataset.Tabular class?

  • from azureml.core import Dataset
  • blob_ds = ws.get_default_datastore()
  • csv_paths = [(blob_ds, ‘data/files/current_data.csv’),
  • (blob_ds, ‘data/files/archive/*.csv’)]
  • tab_ds = Dataset.Tabular.from_delimited_files()
  • tab_ds = tab_ds.register(workspace=ws, name=’csv_table’)
  • from azureml.core import Dataset
  • blob_ds = ws.get_default_datastore()
  • csv_paths = [(blob_ds, ‘data/files/current_data.csv’),
  •  (blob_ds, ‘data/files/archive/*.csv’)]
  • tab_ds = Dataset.Tabular.from_delimited_files(path=csv_paths)
  • tab_ds = tab_ds.register(workspace=ws, name=’csv_table’)
  • from azureml.core import Dataset
  • blob_ds = ws.get_default_datastore()
  • csv_paths = [(blob_ds, ‘data/files/current_data.csv’),
  • (blob_ds, ‘data/files/archive/csv’)]
  • tab_ds = Dataset.Tabular.from_delimited_files(path=csv_paths)
  • tab_ds = tab_ds.register(workspace=ws, name=’csv_table’)
  • from azureml.core import Dataset (CORRECT)
  • blob_ds = ws.change_default_datastore()
  • csv_paths = [(blob_ds, ‘data/files/current_data.csv’),
  • (blob_ds, ‘data/files/archive/*.csv’)]
  • tab_ds = Dataset.Tabular.from_delimited_files(path=csv_paths)
  • tab_ds = tab_ds.register(workspace=ws, name=’csv_table’)

Correct: This is the correct command for this request.

2. You’re creating a file dataset using the from_files method of the Dataset.File class.

You also want to register it in the workspace with the name img_files.

Which SDK commands can you use?

  • from azureml.core import Dataset
  • blob_ds = ws.get_default_datastore()
  • file_ds = Dataset.File.from_files(path=(blob_ds, ‘data/files/images/*.jpg’))
  • file_ds = file_ds.register(workspace=ws, name=’img_files’)
  • from azureml.core import Dataset
  • blob_ds = ws.get_default_datastore()
  • file_ds = Dataset.File.from_files(path=(blob_ds, ‘data/files/images/*.jpg’))
  • from azureml.core import Dataset
  • file_ds = Dataset.File.from_files(path=(blob_ds, ‘data/files/images/*.jpg’))
  • file_ds = file_ds.register(workspace=ws, name=’img_files’)
  • from azureml.core import Dataset (CORRECT)
  • blob_ds = ws.get_default_datastore()
  • file_ds = Dataset.File.from_files(path=(blob_ds, ‘data/files/images’))
  • file_ds = file_ds.register(workspace=ws, name=’img_files’)

Correct: This is the correct and complete command for this scenario.

3. What methods can you use from the Dataset class to retrieve a dataset after registering it? Select all that apply.

  • find_by_id
  • find_by_name
  • get_by_name (CORRECT)
  • get_by_id (CORRECT)

Correct: This method will retrieve a dataset using its name.

Correct: This method will retrieve a dataset using its id.

4. To retrieve a specific version of a data set, which SDK commands should you use?

  • img_ds = Dataset.get_by_name(workspace=ws, name=’img_files’, version(2))
  • img_ds = Dataset.get_by_name(workspace=ws, name=’img_files’, version=’2’)
  • img_ds = Dataset.get_by_name(workspace=ws, name=’img_files’, version_2)
  • img_ds = Dataset.get_by_name(workspace=ws, name=’img_files’, version=2) (CORRECT)

Correct: This is the correct command for this request.

5. Which SDK commands can you use to view the registered environments in your workspace?

  • from azureml.core import Environment
  • env_names = Environment.list(workspace=ws)
  • for env_name in env_names:
  • print(‘Name:’,env_name)
  • from azureml.core import Environment (CORRECT)
  • env_names = Environment.list(workspace=ws)
  • for each env_name in env_names:
  • print(‘Name:’,env_name)
  • from azureml.core import Environment
  • env_names = Environment_list(workspace=ws)
  • for env_name in env_names:
  • print(‘Name:’,env_name)
  • from azureml.core import Environment
  • env_names = Environment.list(workspace=ws)
  • for env_name of env_names:
  • print(‘Name:’,env_name)

Correct: These commands will show you the registered environments in your workspace.

6. You are defining a compute configuration for a managed compute target using the SDK.

Which of the below commands is correct?

  • compute_config = AmlCompute_provisioning_configuration(vm_size=’STANDARD_DS11_V2′,
  • min_nodes=0, max_nodes=4,
  •  vm_priority=’dedicated’)
  • compute_config = AmlCompute.provisioning_configuration(vm_size=’STANDARD_DS11_V2′, (CORRECT)
  • min_nodes=0, max_nodes=4,
  • vm_priority=’dedicated’)
  • compute_config = AmlCompute.provisioning_configuration(vm_size=’STANDARD_DS11_V2′,
  • min_nodes=0, max_nodes=0,
  • vm_priority=’dedicated’)
  • compute_config = AmlCompute.provisioning.configuration(vm_size=’STANDARD_DS11_V2′,
  •  min_nodes=0, max_nodes=4,
  •  vm_priority=’dedicated’)

Correct: These are the correct commands for the job.

7. You created a compute target and now you want to use it for an experiment. You want to specify the compute target using a ComputeTarget object.

Which of the SDK commands below can you use?

  • compute_name = “aml-cluster”
  • training_env = Environment.get(workspace=ws, name=’training_environment’)
  • script_config = ScriptRunConfig(source_directory=’my_dir’,
  • script=’script.py’,
  • environment=training_env,
  • compute_target=training_cluster)
  • compute_name = “aml-cluster” (CORRECT)
  • training_cluster = ComputeTarget(workspace=ws)
  • training_env = Environment.get(workspace=ws, name=’training_environment’)
  • script_config = ScriptRunConfig(source_directory=’my_dir’,
  • script=’script.py’,
  • environment=training_env,
  • compute_target=training_cluster)
  • compute_name = “aml-cluster”
  • training_cluster = ComputeTarget(workspace=ws, 
  • name=compute_name)
  • training_env = Environment.get(workspace=ws, name=’training_environment’)
  • script_config = ScriptRunConfig(source_directory=’my_dir’,
  • script=’script.py’,
  • environment=training_env,
  • compute_target=training_cluster)

Correct: These are the correct commands for the task at hand.

8. Azure Machine Learning supports the creation of datastores for multiple kinds of Azure data source. Which of the following are supported? Select all that apply.

  • Azure Database for PostgreSQ
  • Azure Databricks (CORRECT)
  • Azure Data Lake stores (CORRECT)
  • Correct: Azure Databricks is a supported datastore.

Correct: Azure Data Lake stores is a supported datastore.

9. True or false? 

To add a datastore to your workspace, you can only register it using the graphical interface in Azure Machine Learning Studio? 

  • True
  • False (CORRECT)

Correct: You can view and manage datastores in Azure Machine Learning Studio, or you can use the Azure Machine Learning SDK?

10. Environments are commonly created in docker containers that are in turn hosted in compute targets. Which of the following are examples of compute targets? 

Selectall that apply.

  • Azure blob storage
  • Cloud clusters (CORRECT)
  • Virtual machines (CORRECT)

Correct: Clusters in the cloud are an example of a compute target that can be used in environment creation.

Correct: Virtual machines are an example of a compute target that can be used in environment creation.

11. True or false?

Local compute is generally a great choice during development and testing with low to moderate volumes of data.

  • True (CORRECT)
  • False

Correct: Local compute is generally a great choice during development and testing with low to moderate volumes of data. 

CONCLUSION – Work With Data And Compute In Azure Machine Learning 

In conclusion, data is fundamental to machine learning, serving as the core input for model development. Throughout this module, you have learned to effectively work with datastores and datasets in Azure Machine Learning, enabling you to create scalable, cloud-based training solutions. You have also explored the use of cloud compute resources to run training experiments at scale. With these skills, you are now equipped to integrate data management and cloud computing in Azure Machine Learning, allowing you to build robust, scalable machine learning models with confidence.