COURSE 4: PERFORM DATA SCIENCE WITH AZURE DATABRICKS
Module 3: Processing Data In Azure Data Bricks
MICROSOFT AZURE DATA SCIENTIST ASSOCIATE (DP-100) PROFESSIONAL CERITIFICATE
Complete Coursera Study Guide
Last updated:
INTRODUCTION – Processing Data In Azure Data Bricks
Azure Databricks offers a variety of built-in SQL functions, but there are times when you’ll need to create custom functions, known as User-Defined Functions (UDFs). In this module, you will learn how to register and invoke UDFs, allowing you to extend the capabilities of your SQL queries.
Additionally, you will explore how to use Delta Lake for creating, appending, and upserting data into Apache Spark tables, leveraging its built-in reliability and optimization features to enhance your data management processes.
Learning Objectives
- Write User-Defined Functions
- Perform ETL operations using User-Defined Functions
- Learn about the key features and use cases of Delta Lake.
- Use Delta Lake to create, append, and upsert tables.
- Perform optimizations in Delta Lake.
- Compare different versions of a Delta table using Time Machine.
PRACTICE QUIZ: KNOWLEDGE CHECK 1
1. Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments. This functionality is referred to as?
- Time Travel (CORRECT)
- Schema Evolution
- Schema Enforcement
- ACID Transactions
Correct: Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
2. One of the core features of Delta Lake is performing upserts. Which of the following statements is true in regard to Upsert?
- UpSert is literally TWO operations. Update / Insert (CORRECT)
- Upsert is a new DML statement for SQL syntax
- Upsert is supported in traditional data lakes
Correct: To UPSERT means to “UPdate” and “inSERT”. In other words, UPSERT is literally TWO operations. It is not supported in traditional data lakes.
3. When discussing Delta Lake, there is often a reference to the concept of Bronze, Silver and Gold tables. These levels refer to the state of data refinement as data flows through a processing pipeline and are conceptual guidelines. Based on these table concepts the refinements in Silver tables generally relate to which of the following?
- Highly refined views of the data
- Data that is directly queryable and ready for insights (CORRECT)
- Raw data (or very little processing)
Correct: Silver tables generally relate to data that is directly queryable and ready for insights.
4. What is the Databricks Delta command to display metadata?
- DESCRIBE DETAIL tableName (CORRECT)
- MSCK DETAIL tablename
- SHOW SCHEMA tablename
Correct: You display metadata by using DESCRIBE DETAIL tableName.
5. How do you perform UPSERT in a Delta dataset?
- Use UPSERT INTO my-table /MERGE
- Use UPSERT INTO my-table
- Use MERGE INTO my-table USING data-to-upsert (CORRECT)
Correct: That’s the correct syntax to perform UPSERT in a Databricks Delta dataset.
6. What optimization does the following command perform: OPTIMIZE Students ZORDER BY Grade?
- Ensures that all data backing, for example, Grade=8 is colocated, then rewrites the sorted data into new Parquet files. (CORRECT)
- Ensures that all data backing, for example, Grade=8 is colocated, then updates a graph that routes requests to the appropriate files.
- Creates an order-based index on the Grade field to improve filters against that field.
Correct: ZOrdering colocates related information in the same set of files.
PRACTICE QUIZ: KNOWLEDGE CHECK 2
1. You have a dataframe which you preprocessed and filtered down to only the relevant columns.
- The columns are: id, host_name, bedrooms, neighbourhood_cleansed, price.
- You want to retrieve the first initial from the host_name field.
- How would you write that function in local Python/Scala?
- def firstInitialFunction(name):
- get name[0]
- firstInitialFunction(“Steven”)
- new firstInitialFunction(name):
- return name[1]
- firstInitialFunction(“Steven”)
- def firstInitialFunction(name): (CORRECT)
- return name[0]
- firstInitialFunction(“Steven”)
- new firstInitialFunction(name):
- extract name[]
- firstInitialFunction(“Steven”)
Correct: This is the correct code that will get the first initial from the host_name column.
2. You’ve written a function that retrieves the first initial letter from the host_name column.
You now want to define it as a user-defined named firstInitialUDF.
How you define that using Python/Scala?
- firstInitialUDF = udf(firstInitialFunction) (CORRECT)
- firstInitialUDF = firstInitialFunction()
- firstInitial = udf(firstInitial)
- firstInitial = udf(firstInitialFunction)
Correct: This is the correct code that will create your UDF.
3. If you want to create the UDF in the SQL namespace, what class do you need to use?
- spark.sql.create
- spark.udf.register
- spark.sql.read
- spark.sql.register (CORRECT)
Correct: This class will create the UDF in the SQL namespace.
4. Which is another syntax that you can use to define a UDF in Python?
- Designer
- Decorator (CORRECT)
- Capsulator
- Wrapper
Correct: Alternatively, you can define a UDF using decorator syntax in Python with the dataType the function will return. However, you cannot call the local Python function anymore, e.g. decoratorUDF(“Jane”) will not work.
5. True or false?
The Catalyst Optimizer can be used to optimize UDFs.
- True
- False (CORRECT)
Correct: UDFs cannot be optimized by the Catalyst Optimizer.
QUIZ: TEST PREP
1. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for DDL modifications. This functionality is referred to as?
- Schema Evolution (CORRECT)
- ACID Transactions
- Time Travel
- Schema Enforcement
Correct: Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for DDL modifications.
2. One of the core features of Delta Lake is performing upserts. Which of the following statements is true regarding Upsert?
- Upsert is supported in traditional data lakes
- Upsert is literally TWO operations. Update / Insert (CORRECT)
- Upsert is a new DML statement for SQL syntax
Correct: To UPSERT means to “UPdate” and “inSERT”. In other words, UPSERT is literally TWO operations. It is not supported in traditional data lakes.
3. What is the Databricks Delta command to display metadata?
- SHOW SCHEMA tablename
- MSCK DETAIL tablename
- DESCRIBE DETAIL tableName (CORRECT)
Correct: You display metadata by using DESCRIBE DETAIL tableName.
4. What optimization does the following command perform: OPTIMIZE Customers ZORDER BY City?
- Ensures that all data backing, for example, City=”London” is colocated, then updates a graph that routes requests to the appropriate files.
- Ensures that all data backing, for example, City=’London’ is colocated, then rewrites the sorted data into new Parquet files. (CORRECT)
- Creates an order-based index on the City field to improve filters against that field
Correct: ZOrdering colocates related information in the same set of files.
5. You are planning on registering a user-defined function, g, as g_function in a SQL namespace. How would you achieve this programmatically?
- spark.register_udf(“g_function”, g)
- spark.udf.register(“g_function”, g) (CORRECT)
- spark.udf.register(g, “g_function”)
Correct: This is the correct syntax to register the UDF in the SQL namespace.
6. True or False?
User-defined Functions cannot operate on DataFrames.
- No (CORRECT)
- Yes
Correct: UDF can operate on DataFrames.
7. Suppose you already have a dataframe which only contains relevant columns.
The columns are: id, employee_name, age, gender.
You want to retrieve the first initial from the employee_name field by creating a local function in Python/Scala. Which of the following code can be used to get the first initial from the host_name column?
- new firstInitialFunction(name):
- extract name[]
- firstInitialFunction(“Steven”)
- def firstInitialFunction(name):
- get name[0]
- firstInitialFunction(“Steven”)
- def firstInitialFunction(name): (CORRECT)
- return name[0]
- firstInitialFunction(“Steven”)
Correct: This is the correct code that will get the first initial from the host_name column.
8. In Delta Lake you need to merge, update and delete datasets while ensuring you comply with GDPR and CCPA. What supported APIs can you use?
Select all that apply.
- XML
- Scala (CORRECT)
- Java (CORRECT)
- Python (CORRECT)
- SQL (CORRECT)
Correct: Delta Lake supports Scala, Java, Python, and SQL APIs to merge, update and delete datasets. allowing you to easily comply with GDPR and CCPA.
Correct: Delta Lake supports Scala, Java, Python, and SQL APIs to merge, update and delete datasets. allowing you to easily comply with GDPR and CCPA.
Correct: Delta Lake supports Scala, Java, Python, and SQL APIs to merge, update and delete datasets. allowing you to easily comply with GDPR and CCPA.
Correct: Delta Lake supports Scala, Java, Python, and SQL APIs to merge, update and delete datasets. allowing you to easily comply with GDPR and CCPA.
CONCLUSION – Processing Data In Azure Data Bricks
In conclusion, while Azure Databricks provides numerous built-in SQL functions, the ability to create and use User-Defined Functions (UDFs) offers additional flexibility and power. This module has equipped you with the knowledge to register and invoke UDFs, as well as utilize Delta Lake for creating, appending, and upserting data into Apache Spark tables. By leveraging these tools and techniques, you can optimize your data management and ensure reliable and efficient processing in Azure Databricks.
Quiztudy Top Courses
Popular in Coursera
- Google Advanced Data Analytics
- Google Cybersecurity Professional Certificate
- Meta Marketing Analytics Professional Certificate
- Google Digital Marketing & E-commerce Professional Certificate
- Google UX Design Professional Certificate
- Meta Social Media Marketing Professional Certificate
- Google Project Management Professional Certificate
- Meta Front-End Developer Professional Certificate
Liking our content? Then, don’t forget to ad us to your BOOKMARKS so you can find us easily!

