COURSE 4: PERFORM DATA SCIENCE WITH AZURE DATABRICKS
Module 4: Get Started With Databricks And Machine Learning
MICROSOFT AZURE DATA SCIENTIST ASSOCIATE (DP-100) PROFESSIONAL CERITIFICATE
Complete Coursera Study Guide
Last updated:
INTRODUCTION – Get Started With Databricks And Machine Learning
In this module, you will delve into using PySpark’s machine learning package to construct essential elements of machine learning workflows. This includes performing exploratory data analysis to uncover insights and patterns in your data, training models to predict outcomes based on your data, and evaluating these models to ensure their accuracy and effectiveness.
Additionally, you will learn how to create pipelines for common data featurization tasks, streamlining the process of preparing data for machine learning models. By mastering these skills, you will be well-equipped to handle the end-to-end process of developing and implementing machine learning solutions using PySpark.
Learning Objectives
- Use Azure Databricks and PySpark library to implement key parts of the machine learning workflow
- Build pipelines for common data featurization tasks
PRACTICE QUIZ: KNOWLEDGE CHECK 1
1. Which are the two main types of Machine Learning problems?
Select all that apply.
- Regression
- Classification
- Unsupervised learning (CORRECT)
- Supervised learning (CORRECT)
Correct: In unsupervised learning, the data points aren’t labeled—the algorithm labels them for you by organizing the data or describing its structure. This technique is useful when you don’t know what the outcome should look like.
Correct: In supervised learning, algorithms make predictions based on a set of labeled examples that you provide. This technique is useful when you know what the outcome should look like.
2. You are tasked with using Machine Learning to develop an intelligent app that can predict real estate prices.
The dataset you’re using contains input features and the output variable.
Which type of Machine learning problem is this?
- Supervised (CORRECT)
- Unsupervised
- Semi-supervised
Correct: Supervised learning looks to predict the value of some outcome based on one or more input measures. You would use a regression algorithm to predict the label, or the price, in this scenario.
3. Which type of operation does VectorAssembler perform on the features of the model?
- Load
- Estimate
- Transform (CORRECT)
- Extract
Correct: VectorAssembler is a transformer, which implements a .transform() method.
4. Which are the types of variables that can be found in Machine Learning? Select all that apply.
- Overestimated
- Underestimated
- Quantitative (CORRECT)
- Qualitative (CORRECT)
Correct: Quantitative values are numeric and generally unbounded, taking any positive or negative value.
Correct: Qualitative values take on a set number of classes or categories.
5. Which are some examples of quantitative variables? Select all that apply.
- Gender
- State of residence
- Age (CORRECT)
- Salary (CORRECT)
Correct: This is an example of a quantitative variable.
Correct: This is an example of a quantitative variable.
PRACTICE QUIZ: KNOWLEDGE CHECK 2
1. Which are the three main building blocks that form the machine learning process in Spark from featurization to model training and deployment? Select all that apply.
- Loader
- Extractor
- Transformer (CORRECT)
- Estimator (CORRECT)
- Pipelines (CORRECT)
Correct: This is one of the main abstractions used in Spark.
Correct: This is one of the main abstractions used in Spark.
Correct: This is one of the main abstractions used in Spark.
2. From the Spark’s machine learning library MLlib, which one of the following abstractions takes a dataframe as an input and returns a new dataframe with one or more columns appended to it?
- Estimator
- Transformer (CORRECT)
- Pipeline
Correct: Transformers achieve this by implementing a .transform() method.
3. True or false?
Random forest models also need one-hot encoding.
- True
- False (CORRECT)
Correct: Certain models, such as random forest, do not need one-hot encoding (and can actually be negatively affected by the process).
4. When dealing with null values, which strategy can you implement if you want to see missing data later on without violating the schema?
- Adding a placeholder (CORRECT)
- Dropping the records
- Advanced inputting
- Basic inputting
Correct: This will allow you to see missing data later without violating a schema.
5. When working with regression models, if the p-value of your model coefficient is <0.5 between the input feature and the predicted output, what does that mean? Select all that apply.
- There is more than 95% probability of seeing the correlation by chance.
- There is a 95% probability of seeing the correlation by chance.
- There is a 5% probability of seeing the correlation by chance.
- There is less than 5% chance of seeing the correlation by chance. (CORRECT)
Correct: This is the correct interpretation of p-value 0.5.
QUIZ: TEST PREP
1.How are qualitative variables also known as?
Select all that apply.
- Numerical
- Continuous
- Discrete (CORRECT)
- Categorical (CORRECT)
Correct: This is one of the ways qualitative variables are also known.
Correct: This is one of the ways qualitative variables are also known.
2. Which type of supervised learning problem tends to output quantitative values?
- Regression (CORRECT)
- Classification
- Clustering
Correct: This would be the algorithm used because you would predict a label based on numerical values.
3. In the process of explanatory data analysis, when we want to calculate the number of observations in the data set, which of the following will tell us if there are missing values in the dataset?
- Mean
- Count (CORRECT)
- Standard deviation
Correct: Count gives us the number of observed values, indicating the size of the dataset and whether there are missing values.
4. In terms of correlations, what does a negative correlation of -1 means?
- There is no association between the variables.
- For each unit increase in one variable, the same decrease is seen in the other (CORRECT)
- For each unit increase in one variable, the same increase is seen in the other..
Correct: This is what a negative correlation of -1 indicates.
5. Regarding visualization tools, which of the following can help you visualize quantiles and outliers?
- Box plots (CORRECT)
- Heat maps
- Q-Q plots
- t-SNE
Correct: Boxplot is a chart that is used to visualize how a given data (variable) is distributed using quartiles. It shows the minimum, maximum, median, first quartile and third quartile in the data set.
6. You have an AirBnB dataset where one categorical variable is room type.
There are three types of rooms: private room, entire home/apt, and shared room.
You must first encode each unique string into a number so that the machine learning model knows how to handle these room types.
How should you code that?
- from pyspark.ml.feature import StringIndexer
- uniqueTypesDF = airbnbDF.select(“room_type”).distinct()
- indexer = StringIndexer(inputCol=”room_type”, outputCol=”room_type_index”)
- indexerModel = indexer.transform(uniqueTypesDF)
- indexedDF = indexerModel.transform(uniqueTypesDF)
- display(indexedDF)
- from pyspark.ml.feature import StringIndexer
- uniqueTypesDF = airbnbDF.select(“room_type”).distinct()
- indexer = StringIndexer(inputCol=”room_type”)
- indexerModel = indexer.fit(uniqueTypesDF)
- indexedDF = indexerModel.transform(uniqueTypesDF)
- display(indexedDF)
- from pyspark.ml.feature import Indexer
- uniqueTypesDF = airbnbDF.select(“room_type”).distinct()
- indexer = StringIndexer(inputCol=”room_type”, outputCol=”room_type_index”) indexerModel = indexer.fit(uniqueTypesDF)
- indexedDF = indexerModel.transform(uniqueTypesDF)
- display(indexedDF)
- from pyspark.ml.feature import StringIndexer (CORRECT)
- uniqueTypesDF = airbnbDF.select(“room_type”).distinct()
- indexer = StringIndexer(inputCol=”room_type”, outputCol=”room_type_index”)
- indexerModel = indexer.fit(uniqueTypesDF)
- indexedDF = indexerModel.transform(uniqueTypesDF)
- display(indexedDF)
Correct: This is the correct code.
7. You have an AirBnB dataset where one categorical variable is room type.
There are three types of rooms: private room, entire home/apt, and shared room.
After you’ve encoded each unique string into a number, each room has a unique numerical value assigned.
Now you must one-hot encode each of those values to a location in an array, so that the machine learning algorithm can effect each category.
How should you code that?
- from pyspark.ml.feature import OneHotEncoder
- encoder = OneHotEncoder(inputCols=[“room_type_index”], outputCols=[“encoded_room_type”])
- encoderModel = encoder.fit(indexedDF)
- encodedDF = encoderModel_transform()
- display(encodedDF)
- from pyspark.ml.feature import OneHotEncoder (CORRECT)
- encoder = OneHotEncoder(inputCols=[“room_type_index”], outputCols=[“encoded_room_type”])
- encoderModel = encoder.fit(indexedDF)
- encodedDF = encoderModel.fit (indexedDF
- display(encodedDF)
- from pyspark.ml.feature import OneHotEncoder
- encoder = OneHotEncoder(inputCols=[“room_type_index”], outputCols=[“encoded_room_type”])
- encoderModel = encoder.fit(indexedDF)
- encodedDF = encoderModel(indexedDF)
- display(encodedDF)
- from pyspark.ml.feature import OneHotEncoder
- encoder = OneHotEncoder(inputCols=[“room_type_index”], outputCols=[“encoded_room_type”])
- encoderModel = encoder.fit(indexedDF)
- encodedDF = encoderModel.transform(indexedDF)
- display(encodedDF)
Correct: This is the correct code. You need to change these values to a binary yes/no value if a listing is for a shared room, entire home, or private room.
Do this by training and fitting the OneHotEncoderEstimator, which only operates on numerical values (this is why we needed to use StringIndexer first).
CONCLUSION – Get Started With Databricks And Machine Learning
In conclusion, this module has provided you with a comprehensive understanding of using PySpark’s machine learning package to build critical components of machine learning workflows. You have learned to perform exploratory data analysis, train models, and evaluate their performance, ensuring accurate and effective predictions.
Additionally, you have mastered the creation of pipelines for common data featurization tasks, streamlining the preparation of data for machine learning models. With these skills, you are well-prepared to develop and implement robust machine learning solutions using PySpark.
Quiztudy Top Courses
Popular in Coursera
- Google Advanced Data Analytics
- Google Cybersecurity Professional Certificate
- Meta Marketing Analytics Professional Certificate
- Google Digital Marketing & E-commerce Professional Certificate
- Google UX Design Professional Certificate
- Meta Social Media Marketing Professional Certificate
- Google Project Management Professional Certificate
- Meta Front-End Developer Professional Certificate
Liking our content? Then, don’t forget to ad us to your BOOKMARKS so you can find us easily!

