COURSE 3: GO BEYOND THE NUMBERS: TRANSLATE DATA INTO INSIGHTS

Module 3: Clean Your Data

GOOGLE ADVANCED DATA ANALYTICS PROFESSIONAL CERTIFICATE

Complete Coursera Study Guide

INTRODUCTION – Clean Your Data


Embarking on this segment of the course, you will immerse yourself in a comprehensive exploration of three vital Exploratory Data Analysis (EDA) practices: data cleaning, data joining, and data validation. Each of these practices plays a pivotal role in refining your proficiency in data analysis, equipping you with indispensable skills to navigate the intricacies of diverse datasets. By the end of this module, you will not only understand the theoretical underpinnings of these practices but also develop hands-on expertise in their application.

Firstly, data cleaning stands as a foundational step in the EDA process, addressing inconsistencies and inaccuracies within datasets. In this learning journey, you will unravel the importance of meticulous data cleaning for enhancing the quality and reliability of your analysis. Utilizing Python as your toolkit, you will gain practical experience in employing various cleaning techniques, ensuring that your datasets are free from errors, outliers, and missing values. The significance of this practice becomes apparent as you witness how a clean dataset forms the basis for more accurate and meaningful insights, paving the way for a robust analytical foundation.

Learning Objectives

  • Apply input validation skills to a dataset with Python
  • Explain the importance of input validation
  • Demonstrate how to transform categorical data into numerical data with Python
  • Explain the importance of categorical versus numerical data in a dataset
  • Explain the importance of recognizing outliers in a dataset
  • Demonstrate how to identify outliers in a dataset with Python
  • Understand when to contact stakeholders or engineers regarding missing values
  • Explain the importance of ethically considering missing values
  • Demonstrate how to identify missing data with Python

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: THE CHALLENGE OF MISSING OR DUPLICATE DATA

1. Fill in the blank: Missing data has a value that is not stored for a _____ in a dataset.

  • visualization
  • column
  • variable (CORRECT)
  • row

Correct: Missing data has a value that is not stored for a variable in a dataset. It is typically encoded as N/A, NaN, or a blank.

2. A data professional requests additional information from a dataset’s original owner. Unfortunately, they are not able to provide the information. Therefore, the data professional creates a NaN category in the dataset. What concept does this scenario describe?

  • Solving the problem of missing data (CORRECT)
  • Mapping variables in a dataset
  • Managing big data
  • Ensuring two datasets are compatible

Correct: This scenario describes solving the problem of missing data. There are four common ways to do this: Request the missing values from the owner of the data; delete the missing columns, rows, or values; create a NaN category; or derive new representative values.

3. When merging data, a data professional uses the following code:

df_joined = df.merge(df_zip, how='left', 
on=['date','center_point_geom'])

What is the function of the parameters how and on in this code?

  • To tell Python how to find missing values in the rows and columns
  • To tell Python how to place the appropriate values on the top row of the dataset
  • To tell Python which way to join the data and which column to join from (CORRECT)
  • To tell Python which datasets should be merged

Correct: The parameters how and on tell Python which way to join the data and which column to join from. How tells Python which way to join the data, and on tells Python which column to start from.

4. Non-null count is the total number of blank data entries within a data column.

  • True
  • False (CORRECT)

Correct: Non-null count is the total number of data entries for a data column that are not blank.

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: THE INS AND OUTS OF DATA OUTLIERS

1. What type of outlier is a normal data point under certain conditions, but becomes an anomaly under most other conditions?

  • Global outlier
  • Collective outlier
  • Contextual outlier (CORRECT)
  • Constant outlier

Correct: A contextual outlier is a normal data point under certain conditions, but becomes an anomaly under most other conditions.

2. What is the term for a line of text that follows a method or function, which is used to explain the purpose of that method or function to others using the same code?

  • Factor
  • Annotation
  • Argument
  • Docstring (CORRECT)

Correct: A docstring is a line of text that follows a method or function, which is used to explain the purpose of that method or function to others using the same code.

3. A data professional is using a box plot to identify suspected high outliers in a dataset, according to the interquartile rule. To do that, they search for data points greater than the third quartile plus what standard of the interquartile range?

  • 3 times
  • .5 times
  • 1.5 times (CORRECT)
  • 10 times

Correct: They search for data points greater than the third quartile plus 1.5 times the interquartile range. Box plots can be used to identify outliers in a dataset by showing which data points fall more than the standard of 1.5 times the interquartile range. A rule in statistics says that any data point that exists beyond 1.5 times the interquartile range is an outlier.

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: CHANGING CATEGORICAL DATA TO NUMERICAL DATA

1. Fill in the blank: Label encoding assigns each category a unique _____ instead of a qualitative value.

  • qualifier
  • character
  • string
  • number (CORRECT)

Correct: Label encoding assigns each category a unique number instead of a qualitative value. This process enables data professionals to more effectively work with categorical data.

2. When working with dummy variables, data professionals may assign the variables an infinite number of values.

  • True
  • False (CORRECT)

Correct: Dummy variables have a value of 0 or 1. They are used to indicate the absence (0) or presence (1) of something.

3. Which pandas function does a data professional use to convert categorical variables into dummy variables?

  • convert_categories()
  • get_categories()
  • get_dummies() (CORRECT)
  • convert_dummies()

Correct: The get_dummies() function is used to convert categorical variables into dummy variables.

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: INPUT VALIDATION

1. Data professionals use input validation to ensure data is complete, error-free, and of high-quality.

  • True (CORRECT)
  • False

Correct: Data professionals use input validation to ensure data is complete, error-free, and of high-quality.

2. Fill in the blank: If a dataset lacks sufficient information to answer a business question, the process of _____ makes it possible to augment that data by adding values from other datasets.

  • Sampling
  • Joining (CORRECT)
  • summing
  • blending

Correct: If a dataset lacks sufficient information to answer a business question, the process of joining makes it possible to augment that data by adding values from other datasets. Joining is most useful if the new data is validated to ensure its format and data entries align and are the same data type as the original dataset.

3. In which phase of the PACE workflow would a data professional perform the majority of the data-validation process?

  • Execute
  • Construct
  • Plan
  • Analyze (CORRECT)

Correct: A data professional performs the majority of the data-validation process in the Analyze phase of the PACE workflow. However, it’s important to prioritize data-validation throughout all four phases.

QUIZ: MODULE 3 CHALLENGE

1. Which of the following terms are used to describe missing data? Select all that apply.

  • Zero
  • Blank (CORRECT)
  • NaN (CORRECT)
  • N/A (CORRECT)

Correct!

2. Stakeholders at a film studio hire a data analytics firm to provide insights about the best locations for film shoots. However, the film studio’s datasets contain missing data. Which of the following strategies can help the data analytics firm solve this problem? Select all that apply.

  • Use their best judgment to add in values themselves.
  • Create a NaN category. (CORRECT)
  • Add in the missing values by taking the average values from the existing data. (CORRECT)
  • Ask the film studio to fill in the missing values. (CORRECT)

Correct!

3. A data professional writes the following code:

df.merge(df_zip, how='left', 
    on=['date','center_point_geom'])

Which section of the code refers to the dataframe to be merged with df?

  • df_zip (CORRECT)
  • how=’left’
  • merge
  • center_point_geom

Correct!

4. What pandas function is used to pull all of the missing values from a data frame?

  • pd.getnull()
  • pd.ofnull()
  • pd.findnull()
  • pd.isnull() (CORRECT)

Correct!

5. What type of outliers are values that are completely different from the overall data group and have no association with any other outliers?

  • Collective outliers
  • Global outliers (CORRECT)
  • Contextual outliers
  • Dissimilar outliers

Correct!

6. A data professional works for a car insurance company. To gain insights about the popularity of electric vehicles, they study categorical data about cars. They add a 0 to their dataset to indicate if a car is gas-powered and a 1 if a car is electric. What does this scenario describe?

  • Applying a variable character
  • Changing a floating point
  • Using dummy variables (CORRECT)
  • Removing a data operator

Correct!

7. What type of data visualization shows the concentration of values between two data points by illustrating their magnitude with two colors?

  • Heat map (CORRECT)
  • Treemap
  • Scatter plot
  • Density map

Correct!

8. What does the pandas function pd.duplicated() return to indicate that a data value does not have a duplicate value within the same dataset?

  • True
  • Duplicate
  • Unique
  • False (CORRECT)

Correct!

9. Fill in the blank: The pandas function _____ enables data professionals to create a new dataframe with all duplicate rows removed.

  • drop_duplicates() (CORRECT)
  • deduplicate()
  • de_duplication()
  • deduplication()

Correct!

10. Which of the following terms can be used to describe a value that is not stored for a variable in a set of data? Select all that apply.

  • Zero
  • N/A (CORRECT)
  • NaN (CORRECT)
  • Blank (CORRECT)

Correct!

11. A data professional writes the following code:

df.merge(df_zip, how='left', 
    on=['date','center_point_geom'])

Which of the following is a parameter for the merge?

  • df_joined
  • how=’left’ (CORRECT)
  • df.merge()
  • df.head()

Correct!

12. What tasks could the pandas function pd.isnull() be used for? Select all that apply.

  • To delete all of the values from a data frame
  • To change all values to nulls in a data frame
  • To identify when a value is missing from a data frame (CORRECT)
  • To pull all of the missing values from a data frame (CORRECT)

Correct!

13. Fill in the blank: Contextual outliers are normal data points under certain conditions but become _____ under most other conditions.

  • Insignificant
  • Samples
  • Anomalies (CORRECT)
  • Standard

Correct!

14. A data professional works for a veterinary office. To gain insights about the most common household pets, they study categorical data about pet adoptions over the past five years. They assign the number 1 to dogs, 2 to cats, 3 to hamsters, and so on. What does this scenario describe?

  • Data blending
  • Label encoding (CORRECT)
  • Data partitioning
  • Aliasing

Correct!

15. Fill in the blank: A _____ is a data visualization that displays the magnitude of a set of values using two colors to show the concentration of the values.

  • heat map (CORRECT)
  • bubble chart
  • bar graph
  • line chart

Correct!

16. Fill in the blank: A data professional should _____ a duplicate when its value is clearly a mistake or will misrepresent the remaining unique values within the dataset.

  • Eliminate (CORRECT)
  • keep
  • filter
  • replicate

Correct!

17. Fill in the blank: N/A and NaN are terms used to describe _____ data.

  • Missing (CORRECT)
  • nominal
  • qualitative
  • string

Correct!

18. What does the pandas function pd.duplicated() return to indicate that a data value is a duplicate of another value within the same dataset?

  • Duplicate
  • Unique
  • False
  • True (CORRECT)

Correct!

19. A data professional at a garden center researches data related to ideal growing climates. As they familiarize themselves with the datasets, they discover some data is missing. Which of the following strategies can help them solve this problem? Select all that apply.

  • Change the missing values to Boolean data that is either true or false.
  • Create a NaN category. (CORRECT)
  • Derive new representative values based on available data. (CORRECT)
  • Add in the missing values by taking the average values from the existing data. (CORRECT)

Correct!

20. What pandas function enables a data professional to determine if duplicate values are present in a dataset?

  • pd.deduplication() (CORRECT)
  • pd.duplicated()
  • pd.dupe()
  • pd.deduplicates()

Correct!

21. A data team for an investment banker works on a project related to interest rates. As they familiarize themselves with the datasets, they discover some data is missing. Which of the following strategies can help them solve this problem? Select all that apply.

  • Change the missing values to zeros.
  • Ask the owner of the data to fill in the missing values. (CORRECT)
  • Derive new representative values based on available data. (CORRECT) Add in the missing values by taking the average values from the existing data. (CORRECT)

Correct!

22. A data team works for a stereo installation company. To gain insights into what products people are most likely to purchase in the coming year, they review categorical data about 20 of the most popular stereos. Rather than using brand names, they assign a different number to each stereo to make the data simpler to join. What does this scenario describe?

  • Data smoothing
  • Label encoding (CORRECT)
  • Aggregation
  • Normalization

Correct: Exploratory data analysis, or EDA, is the process data professionals use to investigate, organize, and analyze datasets in order to summarize the data’s main characteristics.

23. A data professional writes the following code:

df.merge(df_zip, how='left', 
    on=['date','center_point_geom'])

Which of the following indicates that the first data frame should be merged with another data frame?

  • on=
  • how=
  • merge() (CORRECT)
  • zip

Correct!

24. What pandas function is used to identify when a value is missing from a data frame?

  • null.pd()
  • pd.null()
  • null().pd
  • pd.isnull() (CORRECT)

Correct!

25. Data encoded as N/A, NaN, or a blank is defined as zero.

  • True
  • False (CORRECT)

Correct: Data encoded as N/A, NaN, or a blank is defined as a value that is not stored for a variable in a dataset. This is different from a data point of zero, which may be a missing value or a legitimate data point.

26. What is indicated by the term null?

  • The data is missing. (CORRECT)     
  • The data point is mandatory.   
  • The data has a value of zero.   
  • The data has a value that is not stored for a variable in the dataset.   

Correct: The term null indicates that the data is missing.   

27. Fill in the blank: Outliers are observations that are an _____ distance from other values.

  • equal
  • optimal
  • adequate
  • abnormal (CORRECT)

Correct: Outliers are observations that are an abnormal distance from other values. They may also be observations that are abnormal compared to the overall pattern of the data population.

28. Docstrings are useful within a line of Python code, but they cannot be exported to create library documentation.

  • True
  • False (CORRECT)

Correct: Docstrings are lines of text following a method or function that explain to others what that method or function does. They can also be easily exported to create library documentation.   

29. Categorical data can be grouped on its qualities, thus enabling data professionals to store and identify it based on its category.

  • True (CORRECT)
  • False

Correct: Categorical data can be grouped depending on its qualities, thus enabling data professionals to store and identify it based on its category.

30. Fill in the blank: A heat map uses  _____ to depict the magnitude of an instance or set of values.

  • Colors (CORRECT)
  • markers
  • lines
  • plots

Correct: A heat map uses colors to depict the magnitude of an instance or set of values. Heat maps are a type of data visualization that is useful for showing the concentration of values between different data points.