COURSE 4: PERFORM DATA SCIENCE WITH AZURE DATABRICKS

Module 4: Get Started With Databricks And Machine Learning

MICROSOFT AZURE DATA SCIENTIST ASSOCIATE (DP-100) PROFESSIONAL CERITIFICATE

Complete Coursera Study Guide

Enroll in Coursera Microsoft Azure Data Scientist Associate (DP-100)

Last updated:

July 10, 2024

INTRODUCTION – Get Started With Databricks And Machine Learning

In this module, you will delve into using PySpark’s machine learning package to construct essential elements of machine learning workflows. This includes performing exploratory data analysis to uncover insights and patterns in your data, training models to predict outcomes based on your data, and evaluating these models to ensure their accuracy and effectiveness.

Additionally, you will learn how to create pipelines for common data featurization tasks, streamlining the process of preparing data for machine learning models. By mastering these skills, you will be well-equipped to handle the end-to-end process of developing and implementing machine learning solutions using PySpark.

Learning Objectives

Use Azure Databricks and PySpark library to implement key parts of the machine learning workflow
Build pipelines for common data featurization tasks

PRACTICE QUIZ: KNOWLEDGE CHECK 1

1. Which are the two main types of Machine Learning problems?

Select all that apply.

Regression
Classification
Unsupervised learning (CORRECT)
Supervised learning (CORRECT)

Correct: In unsupervised learning, the data points aren’t labeled—the algorithm labels them for you by organizing the data or describing its structure. This technique is useful when you don’t know what the outcome should look like.

Correct: In supervised learning, algorithms make predictions based on a set of labeled examples that you provide. This technique is useful when you know what the outcome should look like.

2. You are tasked with using Machine Learning to develop an intelligent app that can predict real estate prices.

The dataset you’re using contains input features and the output variable.

Which type of Machine learning problem is this?

Supervised (CORRECT)
Unsupervised
Semi-supervised

Correct: Supervised learning looks to predict the value of some outcome based on one or more input measures. You would use a regression algorithm to predict the label, or the price, in this scenario.

3. Which type of operation does VectorAssembler perform on the features of the model?

Load
Estimate
Transform (CORRECT)
Extract

Correct: VectorAssembler is a transformer, which implements a .transform() method.

4. Which are the types of variables that can be found in Machine Learning? Select all that apply.

Overestimated
Underestimated
Quantitative (CORRECT)
Qualitative (CORRECT)

Correct: Quantitative values are numeric and generally unbounded, taking any positive or negative value.

Correct: Qualitative values take on a set number of classes or categories.

5. Which are some examples of quantitative variables? Select all that apply.

Gender
State of residence
Age (CORRECT)
Salary (CORRECT)

Correct: This is an example of a quantitative variable.

PRACTICE QUIZ: KNOWLEDGE CHECK 2

1. Which are the three main building blocks that form the machine learning process in Spark from featurization to model training and deployment? Select all that apply.

Loader
Extractor
Transformer (CORRECT)
Estimator (CORRECT)
Pipelines (CORRECT)

Correct: This is one of the main abstractions used in Spark.

2. From the Spark’s machine learning library MLlib, which one of the following abstractions takes a dataframe as an input and returns a new dataframe with one or more columns appended to it?

Estimator
Transformer (CORRECT)
Pipeline

Correct: Transformers achieve this by implementing a .transform() method.

3. True or false?

Random forest models also need one-hot encoding.

True
False (CORRECT)

Correct: Certain models, such as random forest, do not need one-hot encoding (and can actually be negatively affected by the process).

4. When dealing with null values, which strategy can you implement if you want to see missing data later on without violating the schema?

Adding a placeholder (CORRECT)
Dropping the records
Advanced inputting
Basic inputting

Correct: This will allow you to see missing data later without violating a schema.

5. When working with regression models, if the p-value of your model coefficient is <0.5 between the input feature and the predicted output, what does that mean? Select all that apply.

There is more than 95% probability of seeing the correlation by chance.
There is a 95% probability of seeing the correlation by chance.
There is a 5% probability of seeing the correlation by chance.
There is less than 5% chance of seeing the correlation by chance. (CORRECT)

Correct: This is the correct interpretation of p-value 0.5.

QUIZ: TEST PREP

1.How are qualitative variables also known as?

Select all that apply.

Numerical
Continuous
Discrete (CORRECT)
Categorical (CORRECT)

Correct: This is one of the ways qualitative variables are also known.

2. Which type of supervised learning problem tends to output quantitative values?

Regression (CORRECT)
Classification
Clustering

Correct: This would be the algorithm used because you would predict a label based on numerical values.

3. In the process of explanatory data analysis, when we want to calculate the number of observations in the data set, which of the following will tell us if there are missing values in the dataset?

Mean
Count (CORRECT)
Standard deviation

Correct: Count gives us the number of observed values, indicating the size of the dataset and whether there are missing values.

4. In terms of correlations, what does a negative correlation of -1 means?

There is no association between the variables.
For each unit increase in one variable, the same decrease is seen in the other (CORRECT)
For each unit increase in one variable, the same increase is seen in the other..

Correct: This is what a negative correlation of -1 indicates.

5. Regarding visualization tools, which of the following can help you visualize quantiles and outliers?

Box plots (CORRECT)
Heat maps
Q-Q plots
t-SNE

Correct: Boxplot is a chart that is used to visualize how a given data (variable) is distributed using quartiles. It shows the minimum, maximum, median, first quartile and third quartile in the data set.

6. You have an AirBnB dataset where one categorical variable is room type.

There are three types of rooms: private room, entire home/apt, and shared room.

You must first encode each unique string into a number so that the machine learning model knows how to handle these room types.

How should you code that?

from pyspark.ml.feature import StringIndexer
uniqueTypesDF = airbnbDF.select(“room_type”).distinct()
indexer = StringIndexer(inputCol=”room_type”, outputCol=”room_type_index”)
indexerModel = indexer.transform(uniqueTypesDF)
indexedDF = indexerModel.transform(uniqueTypesDF)
display(indexedDF)

from pyspark.ml.feature import StringIndexer
uniqueTypesDF = airbnbDF.select(“room_type”).distinct()
indexer = StringIndexer(inputCol=”room_type”)
indexerModel = indexer.fit(uniqueTypesDF)
indexedDF = indexerModel.transform(uniqueTypesDF)
display(indexedDF)

from pyspark.ml.feature import Indexer
uniqueTypesDF = airbnbDF.select(“room_type”).distinct()
indexer = StringIndexer(inputCol=”room_type”, outputCol=”room_type_index”) indexerModel = indexer.fit(uniqueTypesDF)
indexedDF = indexerModel.transform(uniqueTypesDF)
display(indexedDF)

from pyspark.ml.feature import StringIndexer (CORRECT)
uniqueTypesDF = airbnbDF.select(“room_type”).distinct()
indexer = StringIndexer(inputCol=”room_type”, outputCol=”room_type_index”)
indexerModel = indexer.fit(uniqueTypesDF)
indexedDF = indexerModel.transform(uniqueTypesDF)
display(indexedDF)

Correct: This is the correct code.

7. You have an AirBnB dataset where one categorical variable is room type.

There are three types of rooms: private room, entire home/apt, and shared room.

After you’ve encoded each unique string into a number, each room has a unique numerical value assigned.

Now you must one-hot encode each of those values to a location in an array, so that the machine learning algorithm can effect each category.

How should you code that?

from pyspark.ml.feature import OneHotEncoder
encoder = OneHotEncoder(inputCols=[“room_type_index”], outputCols=[“encoded_room_type”])
encoderModel = encoder.fit(indexedDF)
encodedDF = encoderModel_transform()
display(encodedDF)

from pyspark.ml.feature import OneHotEncoder (CORRECT)
encoder = OneHotEncoder(inputCols=[“room_type_index”], outputCols=[“encoded_room_type”])
encoderModel = encoder.fit(indexedDF)
encodedDF = encoderModel.fit (indexedDF
display(encodedDF)

from pyspark.ml.feature import OneHotEncoder
encoder = OneHotEncoder(inputCols=[“room_type_index”], outputCols=[“encoded_room_type”])
encoderModel = encoder.fit(indexedDF)
encodedDF = encoderModel(indexedDF)
display(encodedDF)

from pyspark.ml.feature import OneHotEncoder
encoder = OneHotEncoder(inputCols=[“room_type_index”], outputCols=[“encoded_room_type”])
encoderModel = encoder.fit(indexedDF)
encodedDF = encoderModel.transform(indexedDF)
display(encodedDF)

Correct: This is the correct code. You need to change these values to a binary yes/no value if a listing is for a shared room, entire home, or private room.

Do this by training and fitting the OneHotEncoderEstimator, which only operates on numerical values (this is why we needed to use StringIndexer first).

CONCLUSION – Get Started With Databricks And Machine Learning

In conclusion, this module has provided you with a comprehensive understanding of using PySpark’s machine learning package to build critical components of machine learning workflows. You have learned to perform exploratory data analysis, train models, and evaluate their performance, ensuring accurate and effective predictions.

Additionally, you have mastered the creation of pipelines for common data featurization tasks, streamlining the preparation of data for machine learning models. With these skills, you are well-prepared to develop and implement robust machine learning solutions using PySpark.

Previous Module

Next Module

Quiztudy Top Courses

Popular in Coursera

Liking our content? Then, don’t forget to ad us to your BOOKMARKS so you can find us easily!

COURSE 4: PERFORM DATA SCIENCE WITH AZURE DATABRICKS

Module 4: Get Started With Databricks And Machine Learning

MICROSOFT AZURE DATA SCIENTIST ASSOCIATE (DP-100) PROFESSIONAL CERITIFICATE

Complete Coursera Study Guide

TABLE OF CONTENT

INTRODUCTION – Get Started With Databricks And Machine Learning

Learning Objectives

PRACTICE QUIZ: KNOWLEDGE CHECK 1

1. Which are the two main types of Machine Learning problems?

Select all that apply.

2. You are tasked with using Machine Learning to develop an intelligent app that can predict real estate prices.

The dataset you’re using contains input features and the output variable.

Which type of Machine learning problem is this?

3. Which type of operation does VectorAssembler perform on the features of the model?

4. Which are the types of variables that can be found in Machine Learning? Select all that apply.

5. Which are some examples of quantitative variables? Select all that apply.

PRACTICE QUIZ: KNOWLEDGE CHECK 2

1. Which are the three main building blocks that form the machine learning process in Spark from featurization to model training and deployment? Select all that apply.

2. From the Spark’s machine learning library MLlib, which one of the following abstractions takes a dataframe as an input and returns a new dataframe with one or more columns appended to it?

3. True or false?

Random forest models also need one-hot encoding.

4. When dealing with null values, which strategy can you implement if you want to see missing data later on without violating the schema?

5. When working with regression models, if the p-value of your model coefficient is <0.5 between the input feature and the predicted output, what does that mean? Select all that apply.

QUIZ: TEST PREP

1.How are qualitative variables also known as?

Select all that apply.

2. Which type of supervised learning problem tends to output quantitative values?

3. In the process of explanatory data analysis, when we want to calculate the number of observations in the data set, which of the following will tell us if there are missing values in the dataset?

4. In terms of correlations, what does a negative correlation of -1 means?

5. Regarding visualization tools, which of the following can help you visualize quantiles and outliers?

6. You have an AirBnB dataset where one categorical variable is room type.

There are three types of rooms: private room, entire home/apt, and shared room.

You must first encode each unique string into a number so that the machine learning model knows how to handle these room types.

How should you code that?

7. You have an AirBnB dataset where one categorical variable is room type.

There are three types of rooms: private room, entire home/apt, and shared room.

After you’ve encoded each unique string into a number, each room has a unique numerical value assigned.

Now you must one-hot encode each of those values to a location in an array, so that the machine learning algorithm can effect each category.

How should you code that?

CONCLUSION – Get Started With Databricks And Machine Learning

Quiztudy Top Courses

Popular in Coursera

Mood Zone for Studying & Relaxing