COURSE 4: PERFORM DATA SCIENCE WITH AZURE DATABRICKS
Module 1: Introduction To Azure Databricks
MICROSOFT AZURE DATA SCIENTIST ASSOCIATE (DP-100) PROFESSIONAL CERITIFICATE
Complete Coursera Study Guide
Last updated:
INTRODUCTION – Introduction To Azure Databricks
In this module, you will explore the powerful features of Azure Databricks and the Apache Spark notebook for handling large datasets. You will gain an understanding of the Azure Databricks platform and learn to recognize the types of tasks that Apache Spark excels at. Additionally, the module will introduce you to the architecture of an Azure Databricks Spark Cluster and the workings of Spark Jobs, providing a comprehensive overview of how these technologies can be leveraged for efficient data processing.
Learning Objectives
- Describe the capabilities of Azure Databricks and the Apache Spark notebook for processing huge files
- Describe the Azure Databricks platform and identify the types of tasks well-suited for Apache Spark
- Describe the architecture of an Azure Databricks Spark Cluster and Spark Jobs
PRACTICE QUIZ: KNOWLEDGE CHECK 1
1. Apache Spark is a unified processing engine that can analyze big data with which of the following features?
Select all that apply.
- Support for multiple Drivers running in parallel on a cluste
- Graph Processing (CORRECT)
- Real-time stream analysis (CORRECT)
- SQL (CORRECT)
- Machine Learning (CORRECT)
Correct: Spark is a unified processing engine that can analyze big data using graph processing.
Correct: Spark is a unified processing engine that can analyze big data using real-time stream analysis.
Correct: Feedback: Spark is a unified processing engine that can analyze big data using SQL.
Correct:Spark is a unified processing engine that can analyze big data using machine learning.
2. Which of the following Databricks features are not Open-Source Spark?
Select all that apply.
- MLFlow
- Databricks Runtime (CORRECT)
- Databricks Workflows (CORRECT)
- Databricks Workspace (CORRECT)
Correct: Databricks Runtime is not open-source Spark.
Correct: Databricks Workflows is not open-source Spark.
Correct: Databricks Workspace is not open-source Spark.
3. Apache Spark notebooks allow which of the following?
Select all that apply.
- Create new Workspace
- Display graphical visualizations (CORRECT)
- Rendering of formatted text (CORRECT)
- Execution of code (CORRECT)
Correct: A notebook is a collection of cells. These cells can display graphical visualizations.
Correct: A notebook is a collection of cells. These cells can be run to render formatted text.
Correct: A notebook is a collection of cells. These cells are run to execute code.
4. In Azure Databricks when creating a new Notebook, the default languages available to select from are?
Select all that apply.
- Java
- R (CORRECT)
- Scala (CORRECT)
- Python (CORRECT)
- SQL (CORRECT)
Correct: In Azure Databricks when creating a new Notebook, one of the default languages available to select from is R.
Correct: In Azure Databricks when creating a new Notebook, one of the default languages available to select from is Scala.
Correct: In Azure Databricks when creating a new Notebook, one of the default languages available to select from is Python.
Correct: In Azure Databricks when creating a new Notebook, one of the default languages available to select from is SQL.
5. If your notebook is attached to a cluster, you can carry out which of the following from within the notebook?
Select all that apply.
- Delete the cluster
- Attach to another cluster (CORRECT)
- Restart the cluster (CORRECT)
- Detach your notebook from the cluster (CORRECT)
Correct: If your notebook is attached to a cluster, you can attach to another cluster.
Correct: If your notebook is attached to a cluster, you can restart the cluster.
Correct: If your notebook is attached to a cluster, you can detach your notebook from the cluster.
PRACTICE QUIZ: KNOWLEDGE CHECK 2
1. Select all that apply.
You work with Big Data as a data engineer or a data scientist, and you must process data that is oftentimes referred to as the “3 Vs of Big Data”. What do the 3Vs of Big Data stand for?
- Variable
- Variety (CORRECT)
- Velocity (CORRECT)
- Volume (CORRECT)
Correct: Variety – Your data types are varied, from structured relational data sets and financial transactions to unstructured data such as chat and SMS messages, IoT devices, images, logs, MRIs, etc.
Correct: High velocity – You require streaming and real-time processing capabilities.
Correct: High volume – You must process an extremely large volume of data and need to scale out your compute accordingly.
2. Spark’s performance is based on parallelism. Which of the following Scalability methods is limited to a finite amount of RAM, Threads and CPU speeds?
- Horizontal Scaling
- Diagonal Scaling
- Vertical Scaling (CORRECT)
Correct: Scaling vertically is limited to a finite amount of RAM, Threads and CPU speeds.
3. In an Apache Spark Cluster jobs are divided into which of the following?
- Executors
- Slots
- Drivers
- Tasks (CORRECT)
Correct: Jobs are subdivided into tasks. The input to a job is partitioned into one or more partitions. These partitions are the unit of work for each slot.
4. When creating a new cluster in the Azure Databricks workspace, which of the following is a sequence of steps that happens in the background?
- When an Azure Databricks workspace is deployed, you are allocated a pool of VMs. Creating a cluster draws from this pool.
- Azure Databricks creates a cluster of driver and worker nodes, based on your VM type and size selections. (CORRECT)
- Azure Databricks provisions a dedicated VM (Virtual Machine) that processes all jobs, based on your VM type and size selection.
Correct: At the time of cluster creation, you specify the types and sizes of the virtual machines (VMs) to use for both the Driver and Worker nodes, but Azure Databricks manages all other aspects of the cluster.
5. To parallelize work, the unit of distribution is a Spark Cluster. Every Cluster has a Driver and one or more executors. Work submitted to the Cluster is split into what type of object?
- Arrays
- Stages
- Jobs (CORRECT)
Correct: Each parallelized action is referred to as a Job. The results of each Job is returned to the Driver. Depending on the work required, multiple Jobs will be required. Each Job is broken down into Stages.
6. Spark Cluster use two levels of parallelization. Which of the following are levels of parallelization?
- Job
- Partition
- Slot (CORRECT)
- Executor (CORRECT)
Correct: The second level of parallelization is the Slot – the number of which is determined by the number of cores and CPUs of each node.
Correct:The first level of parallelization is the Executor – a Java virtual machine running on a node, typically, one instance per node.
QUIZ: TEST PREP
1. Azure Databricks Runtime adds several key capabilities to Apache Spark workloads that can increase performance and reduce costs. Which of the following are features of Azure Databricks?
Select all that apply.
- Parallel Cluster Drivers
- High-speed connectors to Azure storage services (CORRECT)
- Caching (CORRECT)
- Auto-scaling and auto-termination (CORRECT)
- Indexing (CORRECT)
Correct: Azure Databricks Runtime adds several key capabilities to Apache Spark workloads that can increase performance and reduce costs including High-speed connectors to Azure storage services.
Correct: Azure Databricks Runtime adds several key capabilities to Apache Spark workloads that can increase performance and reduce costs including Caching.
Correct: Try going back and reviewing the Describe Azure Databricks lesson.
Correct: Azure Databricks Runtime adds several key capabilities to Apache Spark workloads that can increase performance and reduce costs including Indexing.
2. Apache Spark supports which of the following languages?
Select all that apply.
- ORC
- Python (CORRECT)
- Java (CORRECT)
- Scala (CORRECT)
Correct: Apache Spark supports Python.
Correct: Apache Spark supports Java.
Correct: Apache Spark supports Scala.
3. Which of the following statements are True
Select all that apply.
- Once created a notebook can only be connected to the original cluster.
- To use your Azure Databricks notebook to run code you do not require a cluster
- You can detach a notebook from a cluster and attach it to another cluster. (CORRECT)
- To use your Azure Databricks notebook to run code, you must attach it to a cluster (CORRECT)
Correct: detach your notebook from a cluster and attach it to another depending upon your organization’s requirements.
Correct: To use your notebook to run a code, you must attach it to a cluster.
4. Which of the following Databricks features are not Open-Source Spark?
- MLFlow
- Databricks Runtime (CORRECT)
- Databricks Workflows (CORRECT)
- Databricks Workspace (CORRECT)
Correct: Databricks Runtime is not open-source Spark
Correct: Databricks Workflows is not open-source Spark
Correct: Databricks Workspace is not open-source Spark
5. How many drivers does a Cluster have?
- Configurable between one and eight
- Two, running in parallel
- Only one (CORRECT)
Correct: Feedback: A Cluster has one and only one driver.
6. What type of process are the driver and the executors?
- Python processes
- Java processes (CORRECT)
- C++ processes
Correct: The driver and the executors are Java processes.
7. You work with Big Data as a data engineer, and you must process real-time data. This is referred to as having which of the following characteristics?
- High volume
- High velocity(CORRECT)
- Variety
Correct: This characteristic relates to the requirement for streaming and real-time processing capabilities.
8. Spark’s performance is based on parallelism. Which of the following Scalability methods is limited to a finite amount of RAM, Threads and CPU speeds?
- Diagonal Scaling
- Vertical Scaling (CORRECT)
- Horizontal Scaling
Correct: Scaling vertically is limited to a finite amount of RAM, Threads and CPU speeds.
9. Spark Cluster use two levels of parallelization. Which of the following are levels of parallelization?
- Partition
- Job
- Slot (CORRECT)
- Executor (CORRECT)
Correct: The second level of parallelization is the Slot – the number of which is determined by the number of cores and CPUs of each node.
Correct: The first level of parallelization is the Executor – a Java virtual machine running on a node, typically, one instance per node.
10. In an Apache Spark Cluster jobs are divided into which of the following?
- Tasks (CORRECT)
- Drivers
- Executors
- Slots
Correct: Jobs are subdivided into tasks. The input to a job is partitioned into one or more partitions. These partitions are the unit of work for each slot.
11. You are introducing the Databricks platform to your team. Which of the following features can you demonstrate to them? Check all that apply
- Advanced query optimization (CORRECT)
- Caching and indexing (CORRECT)
- High-speed Azure connection (CORRECT)
- Automation of Spark clusters (CORRECT)
Correct: The focus of Azure Databricks is optimization. The platform was optimized from the ground up.
Correct: The platform increases performance and reduces costs by automating management.
Correct: Databricks provides high-speed connectors to Azure storage services, such as Azure Blob Store and Azure Data Lake.
Correct: Databricks offers auto-scaling and auto-termination of Spark clusters to minimize costs.
12. Identify the components of an Azure Databricks Spark cluster. Check all that apply.
- Slot (CORRECT)
- Driver (CORRECT)
- Task (CORRECT)
- Worker (CORRECT)
Correct: Work sent from the driver nodes to the worker nodes is assigned to slots, instructing them to pull data from a specified data source.
Correct: An Apache Spark cluster, the driver is the notebook interface. It contains the main loop for the program and creates distributed datasets on the cluster.
Correct: A task is sent from the driver node to the worker nodes.
Correct: The Spark clusters, or datasets, are also known as workers.
13. You have deployed a series of jobs within a Spark cluster. How will the Spark cluster assign jobs to the nodes?
- Vertically
- Horizontally (CORRECT)
Correct: Spark scales jobs horizontally through a process known as parallelism.
CONCLUSION – Introduction To Azure Databricks
In conclusion, this module provides a thorough exploration of Azure Databricks and the Apache Spark notebook, highlighting their capabilities in processing large files. By understanding the Azure Databricks platform, identifying tasks suited for Apache Spark, and learning about the architecture of a Spark Cluster and Spark Jobs, you will be well-equipped to leverage these technologies for efficient and effective data processing.
Quiztudy Top Courses
Popular in Coursera
- Google Advanced Data Analytics
- Google Cybersecurity Professional Certificate
- Meta Marketing Analytics Professional Certificate
- Google Digital Marketing & E-commerce Professional Certificate
- Google UX Design Professional Certificate
- Meta Social Media Marketing Professional Certificate
- Google Project Management Professional Certificate
- Meta Front-End Developer Professional Certificate
Liking our content? Then, don’t forget to ad us to your BOOKMARKS so you can find us easily!

