COURSE 4: PERFORM DATA SCIENCE WITH AZURE DATABRICKS

Module 2: Working With Data In Azure Databricks

MICROSOFT AZURE DATA SCIENTIST ASSOCIATE (DP-100) PROFESSIONAL CERITIFICATE

Complete Coursera Study Guide

Last updated:

INTRODUCTION – Working With Data In Azure Databricks

Azure Databricks facilitates daily data-handling tasks like reading, writing, and querying data. In this module, you’ll manage large datasets from various sources and raw formats. You’ll learn to use the DataFrame Column Class in Azure Databricks to apply column-level transformations, including sorting, filtering, and aggregating data. Additionally, you’ll explore advanced DataFrame functions to manipulate data, perform aggregations, and handle date and time operations within Azure Databricks.

Learning Objectives

  • Describe how to use Azure Databricks supports day-to-day data-handling functions, such as reads, writes, and queries
  • Use the DataFrame Column Class Azure Databricks to apply column-level transformations, such as sorts, filters, and aggregations
  • Use advanced DataFrame functions operations to manipulate data, apply aggregates, and perform data and time operations in Azure Databricks

PRACTICE QUIZ: KNOWLEDGE CHECK 1

1. How do you list files in DBFS within a notebook?

  • %fs dir /my-file-path
  • ls /my-file-path
  • %fs ls /my-file-path (CORRECT)

Correct: You added the file system magic to the cell before executing the ls command.

2. How do you infer the data types and column names when you read a JSON file?

  • spark.read.inferSchema(“true”).json(jsonFile)*
  • spark.read.option(“inferData”, “true”).json(jsonFile)
  • spark.read.option(“inferSchema”, “true”).json(jsonFile) (CORRECT)

Correct: This approach is the correct way to infer the file’s schema.

3. Which of the following SparkSession functions returns a DataFrameReader

  • createDataFrame(..)
  • emptyDataFrame(..)
  • read(..) (CORRECT)
  • readStream(..)

Correct: The function SparkSession.read() returns a DataFrameReader.

4. When using a notebook and a spark session. We can read a CSV file. Which of the following can be used to view the first couple thousand characters of a file?

  • %fs dir /mnt/training/wikipedia/pageviews/
  • %fs ls /mnt/training/wikipedia/pageviews/
  • %fs head /mnt/training/wikipedia/pageviews/pageviews_by_second.tsv (CORRECT)

Correct: We can use %fs head … to view the first couple thousand characters of a file.

PRACTICE QUIZ: KNOWLEDGE CHECK

1. Which of the following SparkSession functions returns a DataFrameReader

  • createDataFrame(..)
  • emptyDataFrame(..)
  • read(..) (CORRECT)
  • .readStream(..)

Correct: The function SparkSession.read() returns a DataFrameReader

2. When using a notebook and a spark session. We can read a CSV file.

Which of the following can be used to view the first couple of thousand characters of a file

  • %fs head /mnt/training/wikipedia/pageviews/pageviews_by_second.tsv (CORRECT)
  • %fs ls /mnt/training/wikipedia/pageviews/
  •  %fs dir /mnt/training/wikipedia/pageviews/

Correct: We can use %fs head … to view the first couple thousand characters of a file

3. Which DataFrame method do you use to create a temporary view?

  • createOrReplaceTempView() (CORRECT)
  • createTempViewDF()
  • createTempView()

Correct: You use this method to create temporary views in DataFrames.

4. How do you define a DataFrame object?

  • Use the DF.create() syntax
  • Introduce a variable name and equate it to something like myDataFrameDF = (CORRECT)
  • Use the createDataFrame() function

Correct: This approach is the correct way to create DataFrame objects.

5. How do you cache data into the memory of the local executor for instant access?

  • .inMemory().save()
  • .cache() (CORRECT)
  • .save().inMemory()

Correct: The cache() method is an alias for persist(). Calling this moves data into the memory of the local executor. 

6. What is the Python syntax for defining a DataFrame in Spark from an existing Parquet file in DBFS?

  • IPGeocodeDF = parquet.read(“dbfs:/mnt/training/ip-geocode.parquet”)
  • IPGeocodeDF = spark.parquet.read(“dbfs:/mnt/training/ip-geocode.parquet”)
  • IPGeocodeDF = spark.read.parquet(“dbfs:/mnt/training/ip-geocode.parquet”)(CORRECT)

Correct: This syntax is correct.

QUIZ: TEST PREP

1. How do you list files in DBFS within a notebook?

  • ls /my-file-path
  • %fs dir /my-file-path
  • %fs ls /my-file-path (CORRECT)

Correct: Feedback: Correct. You added the file system magic to the cell before executing the ls command.

2. How do you infer the data types and column names when you read a JSON file?

  • spark.read.option(“inferSchema”, “true”).json(jsonFile) (CORRECT)
  • spark.read.inferSchema(“true”).json(jsonFile)
  • spark.read.option(“inferData”, “true”).json(jsonFile)

Correct: This approach is the correct way to infer the file’s schema.

3. Which of the following SparkSession functions returns a DataFrameReader?

  • readStream(..)
  • createDataFrame(..)
  • emptyDataFrame(..)
  • read(..) (CORRECT)

Correct: The function SparkSession.read() returns a DataFrameReader.

4. When using a notebook and a spark session. We can read a CSV file. Which of the following can be used to view the first couple thousand characters of a file?

  • %fs ls /mnt/training/wikipedia/pageviews/
  • %fs head /mnt/training/wikipedia/pageviews/pageviews_by_second.tsv (CORRECT)
  • %fs dir /mnt/training/wikipedia/pageviews/

Correct: We can use %fs head … to view the first couple thousand characters of a file.

5. You have created an Azure Databricks cluster, and you have access to a source file.

fileName = “dbfs:/mnt/training/wikipedia/clickstream/2015_02_clickstream.tsv”

You need to determine the structure of the file. Which of the following commands will assist with determining what the column and data types are?

  • .option(“inferSchema”, “true”) (CORRECT)
  • .option(“header”, “true”)
  • .option(“inferSchema”, “false”)
  • .option(“header”, “false”)

Correct: using .option(“inferSchema”, “true”) Spark will automatically go through the file and infer the schema of each column.

6. In an Azure Databricks workspace you run the following command:   

%fs head /mnt/training/wikipedia/pageviews/pageviews_by_second.tsv

The partial output from this command is as follows:

[Truncated to first 65536 bytes]

“timestamp”   “site”   “requests”

“2015-03-16T00:09:55”   “mobile”   1595

“2015-03-16T00:10:39”   “mobile”   1544

“2015-03-16T00:19:39”   “desktop”   2460

“2015-03-16T00:38:11”   “desktop”   2237

“2015-03-16T00:42:40”   “mobile”   1656

“2015-03-16T00:52:24”   “desktop”   2452

Which of the following pieces of information can be inferred from the command and the output? 

Select all that apply.

  • the file is a comma separated or CSV file
  • All columns are strings
  • The file has no header
  • Two columns are strings, and one column is a number (CORRECT)
  • The column is Tab separated (CORRECT)
  • The file has a header (CORRECT)

Correct: The strings are enclosed in double quotes while the number column is not

Correct: Feedback: The file is tab separated. This can be inferred from the file extension and the lack of other characters between each “column”.

Correct: The first line of the output displays the column names.

7. In an Azure Databricks you wish to create a temporary view that will be accessible to multiple notebooks. Which of the following commands will provide this feature?

  • createOrReplaceTempView(set_scope “Global”)
  • createOrReplaceTempView(..)
  • createOrReplaceGlobalTempView(..) (CORRECT)

Correct: Feedback: The spark method createOrReplaceGlobalTempView(..) is bound to the spark application allowing it to be read in this notebook and from another.

8. Which of the following is true in respect of Parquet Files?

Select all that apply.

  • Designed for performance on small data sets
  • Is a Row-Oriented data store
  • Is a splittable “file format”. (CORRECT)
  • Efficient data compression (CORRECT)
  • Is a Column-Oriented data store (CORRECT)
  • Open Source (CORRECT)

Correct: Parquet files are splittable.

Correct: Parquet files provide efficient data compression.

Correct: Parquet files are Column-Oriented.

Correct: Parquet files are free Open Source.

CONCLUSION – Working With Data In Azure Databricks

In conclusion, Azure Databricks is a robust platform for everyday data-handling tasks, such as reading, writing, and querying. This module equips you to manage large datasets from diverse sources and raw formats, apply column-level transformations using the DataFrame Column Class, and utilize advanced DataFrame functions for data manipulation, aggregation, and date and time operations. By mastering these skills, you will be well-prepared to effectively handle complex data workflows in Azure Databricks.