COURSE 4: PERFORM DATA SCIENCE WITH AZURE DATABRICKS

Module 2: Working With Data In Azure Databricks

MICROSOFT AZURE DATA SCIENTIST ASSOCIATE (DP-100) PROFESSIONAL CERITIFICATE

Complete Coursera Study Guide

Enroll in Coursera Microsoft Azure Data Scientist Associate (DP-100)

Last updated:

July 10, 2024

TABLE OF CONTENT

Introduction
Practice Quiz: Knowledge check 1
Practice Quiz: Knowledge check 2
Quiz: Test Prep
Conclusion

INTRODUCTION – Working With Data In Azure Databricks

Azure Databricks facilitates daily data-handling tasks like reading, writing, and querying data. In this module, you’ll manage large datasets from various sources and raw formats. You’ll learn to use the DataFrame Column Class in Azure Databricks to apply column-level transformations, including sorting, filtering, and aggregating data. Additionally, you’ll explore advanced DataFrame functions to manipulate data, perform aggregations, and handle date and time operations within Azure Databricks.

Learning Objectives

Describe how to use Azure Databricks supports day-to-day data-handling functions, such as reads, writes, and queries
Use the DataFrame Column Class Azure Databricks to apply column-level transformations, such as sorts, filters, and aggregations
Use advanced DataFrame functions operations to manipulate data, apply aggregates, and perform data and time operations in Azure Databricks

PRACTICE QUIZ: KNOWLEDGE CHECK 1

1. How do you list files in DBFS within a notebook?

%fs dir /my-file-path
ls /my-file-path
%fs ls /my-file-path (CORRECT)

Correct: You added the file system magic to the cell before executing the ls command.

2. How do you infer the data types and column names when you read a JSON file?

spark.read.inferSchema(“true”).json(jsonFile)*
spark.read.option(“inferData”, “true”).json(jsonFile)
spark.read.option(“inferSchema”, “true”).json(jsonFile) (CORRECT)

Correct: This approach is the correct way to infer the file’s schema.

3. Which of the following SparkSession functions returns a DataFrameReader

createDataFrame(..)
emptyDataFrame(..)
read(..) (CORRECT)
readStream(..)

Correct: The function SparkSession.read() returns a DataFrameReader.

4. When using a notebook and a spark session. We can read a CSV file. Which of the following can be used to view the first couple thousand characters of a file?

%fs dir /mnt/training/wikipedia/pageviews/
%fs ls /mnt/training/wikipedia/pageviews/
%fs head /mnt/training/wikipedia/pageviews/pageviews_by_second.tsv (CORRECT)

Correct: We can use %fs head … to view the first couple thousand characters of a file.

PRACTICE QUIZ: KNOWLEDGE CHECK

1. Which of the following SparkSession functions returns a DataFrameReader

createDataFrame(..)
emptyDataFrame(..)
read(..) (CORRECT)
.readStream(..)

Correct: The function SparkSession.read() returns a DataFrameReader

2. When using a notebook and a spark session. We can read a CSV file.

Which of the following can be used to view the first couple of thousand characters of a file

%fs head /mnt/training/wikipedia/pageviews/pageviews_by_second.tsv (CORRECT)
%fs ls /mnt/training/wikipedia/pageviews/
%fs dir /mnt/training/wikipedia/pageviews/

Correct: We can use %fs head … to view the first couple thousand characters of a file

3. Which DataFrame method do you use to create a temporary view?

createOrReplaceTempView() (CORRECT)
createTempViewDF()
createTempView()

Correct: You use this method to create temporary views in DataFrames.

4. How do you define a DataFrame object?

Use the DF.create() syntax
Introduce a variable name and equate it to something like myDataFrameDF = (CORRECT)
Use the createDataFrame() function

Correct: This approach is the correct way to create DataFrame objects.

5. How do you cache data into the memory of the local executor for instant access?

.inMemory().save()
.cache() (CORRECT)
.save().inMemory()

Correct: The cache() method is an alias for persist(). Calling this moves data into the memory of the local executor.

6. What is the Python syntax for defining a DataFrame in Spark from an existing Parquet file in DBFS?

IPGeocodeDF = parquet.read(“dbfs:/mnt/training/ip-geocode.parquet”)
IPGeocodeDF = spark.parquet.read(“dbfs:/mnt/training/ip-geocode.parquet”)
IPGeocodeDF = spark.read.parquet(“dbfs:/mnt/training/ip-geocode.parquet”)(CORRECT)

Correct: This syntax is correct.

QUIZ: TEST PREP

1. How do you list files in DBFS within a notebook?

ls /my-file-path
%fs dir /my-file-path
%fs ls /my-file-path (CORRECT)

Correct: Feedback: Correct. You added the file system magic to the cell before executing the ls command.

2. How do you infer the data types and column names when you read a JSON file?

spark.read.option(“inferSchema”, “true”).json(jsonFile) (CORRECT)
spark.read.inferSchema(“true”).json(jsonFile)
spark.read.option(“inferData”, “true”).json(jsonFile)

Correct: This approach is the correct way to infer the file’s schema.

3. Which of the following SparkSession functions returns a DataFrameReader?

readStream(..)
createDataFrame(..)
emptyDataFrame(..)
read(..) (CORRECT)

Correct: The function SparkSession.read() returns a DataFrameReader.

4. When using a notebook and a spark session. We can read a CSV file. Which of the following can be used to view the first couple thousand characters of a file?

%fs ls /mnt/training/wikipedia/pageviews/
%fs head /mnt/training/wikipedia/pageviews/pageviews_by_second.tsv (CORRECT)
%fs dir /mnt/training/wikipedia/pageviews/

Correct: We can use %fs head … to view the first couple thousand characters of a file.

5. You have created an Azure Databricks cluster, and you have access to a source file.

fileName = “dbfs:/mnt/training/wikipedia/clickstream/2015_02_clickstream.tsv”

You need to determine the structure of the file. Which of the following commands will assist with determining what the column and data types are?

.option(“inferSchema”, “true”) (CORRECT)
.option(“header”, “true”)
.option(“inferSchema”, “false”)
.option(“header”, “false”)

Correct: using .option(“inferSchema”, “true”) Spark will automatically go through the file and infer the schema of each column.

6. In an Azure Databricks workspace you run the following command:

%fs head /mnt/training/wikipedia/pageviews/pageviews_by_second.tsv

The partial output from this command is as follows:

[Truncated to first 65536 bytes]

“timestamp” “site” “requests”

“2015-03-16T00:09:55” “mobile” 1595

“2015-03-16T00:10:39” “mobile” 1544

“2015-03-16T00:19:39” “desktop” 2460

“2015-03-16T00:38:11” “desktop” 2237

“2015-03-16T00:42:40” “mobile” 1656

“2015-03-16T00:52:24” “desktop” 2452

Which of the following pieces of information can be inferred from the command and the output?

Select all that apply.

the file is a comma separated or CSV file
All columns are strings
The file has no header
Two columns are strings, and one column is a number (CORRECT)
The column is Tab separated (CORRECT)
The file has a header (CORRECT)

Correct: The strings are enclosed in double quotes while the number column is not

Correct: Feedback: The file is tab separated. This can be inferred from the file extension and the lack of other characters between each “column”.

Correct: The first line of the output displays the column names.

7. In an Azure Databricks you wish to create a temporary view that will be accessible to multiple notebooks. Which of the following commands will provide this feature?

createOrReplaceTempView(set_scope “Global”)
createOrReplaceTempView(..)
createOrReplaceGlobalTempView(..) (CORRECT)

Correct: Feedback: The spark method createOrReplaceGlobalTempView(..) is bound to the spark application allowing it to be read in this notebook and from another.

8. Which of the following is true in respect of Parquet Files?

Select all that apply.

Designed for performance on small data sets
Is a Row-Oriented data store
Is a splittable “file format”. (CORRECT)
Efficient data compression (CORRECT)
Is a Column-Oriented data store (CORRECT)
Open Source (CORRECT)

Correct: Parquet files are splittable.

Correct: Parquet files provide efficient data compression.

Correct: Parquet files are Column-Oriented.

Correct: Parquet files are free Open Source.

CONCLUSION – Working With Data In Azure Databricks

In conclusion, Azure Databricks is a robust platform for everyday data-handling tasks, such as reading, writing, and querying. This module equips you to manage large datasets from diverse sources and raw formats, apply column-level transformations using the DataFrame Column Class, and utilize advanced DataFrame functions for data manipulation, aggregation, and date and time operations. By mastering these skills, you will be well-prepared to effectively handle complex data workflows in Azure Databricks.

Previous Module

Next Module

Quiztudy Top Courses

Popular in Coursera

Liking our content? Then, don’t forget to ad us to your BOOKMARKS so you can find us easily!

COURSE 4: PERFORM DATA SCIENCE WITH AZURE DATABRICKS

Module 2: Working With Data In Azure Databricks

MICROSOFT AZURE DATA SCIENTIST ASSOCIATE (DP-100) PROFESSIONAL CERITIFICATE

Complete Coursera Study Guide

TABLE OF CONTENT

INTRODUCTION – Working With Data In Azure Databricks

Learning Objectives

PRACTICE QUIZ: KNOWLEDGE CHECK 1

1. How do you list files in DBFS within a notebook?

2. How do you infer the data types and column names when you read a JSON file?

3. Which of the following SparkSession functions returns a DataFrameReader

4. When using a notebook and a spark session. We can read a CSV file. Which of the following can be used to view the first couple thousand characters of a file?

PRACTICE QUIZ: KNOWLEDGE CHECK

1. Which of the following SparkSession functions returns a DataFrameReader

2. When using a notebook and a spark session. We can read a CSV file.

Which of the following can be used to view the first couple of thousand characters of a file

3. Which DataFrame method do you use to create a temporary view?

4. How do you define a DataFrame object?

5. How do you cache data into the memory of the local executor for instant access?

6. What is the Python syntax for defining a DataFrame in Spark from an existing Parquet file in DBFS?

QUIZ: TEST PREP

1. How do you list files in DBFS within a notebook?

2. How do you infer the data types and column names when you read a JSON file?

3. Which of the following SparkSession functions returns a DataFrameReader?

4. When using a notebook and a spark session. We can read a CSV file. Which of the following can be used to view the first couple thousand characters of a file?

5. You have created an Azure Databricks cluster, and you have access to a source file.

fileName = “dbfs:/mnt/training/wikipedia/clickstream/2015_02_clickstream.tsv”

You need to determine the structure of the file. Which of the following commands will assist with determining what the column and data types are?

6. In an Azure Databricks workspace you run the following command:

7. In an Azure Databricks you wish to create a temporary view that will be accessible to multiple notebooks. Which of the following commands will provide this feature?

8. Which of the following is true in respect of Parquet Files?

Select all that apply.

CONCLUSION – Working With Data In Azure Databricks

Quiztudy Top Courses

Popular in Coursera

Mood Zone for Studying & Relaxing