META MARKETING ANALYTICS PROFESSIONAL CERTIFICATE

Course 2: Introduction to Data Analytics

Week 3: Data Cleaning and Processing

Coursera Study Guide

Click to Enroll in Coursera Meta Marketing Analytics Professional Certificate

CONTENT

In week three, you’ll dig into how to clean and process data you’ve gathered using spreadsheets, SQL, and the Python Data Analytics Stack (Pandas).

Learning Objectives

  • Sort, query and structure data with spreadsheets
  • Evaluate the limitations of spreadsheets as a means of cleaning data
  • Sort, query and structure data with SQL
  • Use Pandas DataFrames to clean and explore data
  • Sort, query and structure data using Pandas, a Python library

PRACTICE QUIZ: FILTERING DATA WITH GOOGLE SHEETS

1. If you have a list of numbers, and you’d like to arrange them from lowest to highest, you would:

  • All of the above
  • Visualize them
  • Filter them
  • Sort them (CORRECT)

Correct: That’s right! Sorting in a spreadsheet tool allows you to change how and in what order your data is presented.

2. True or false: Filtering data in a spreadsheet deletes the information that is no longer shown.

  • True
  • False (CORRECT)

Correct: That’s right! You can simply turn a filter off to show all of the data again.

3. Which function would you use to find the mean of a set of data in a spreadsheet?

  • =MEAN()
  • =AVERAGE() (CORRECT)
  • =MEDIAN()

Correct: That’s right! Mean and average are often used interchangeably.

4. True or false: Once a spreadsheet has determined an appropriate visualization for your data, you can’t make changes to it.

  • True
  • False (CORRECT)

Correct: That’s right! You can make as many alterations and customizations as you need to a visualization.

PRACTICE QUIZ: SQL SELECT STATEMENTS 

1. Use the table of advertising expenditures below to answer ALL the questions in this quiz.

CompanyQuarterInternetTVPrint
FocusVu13,00050003000
Paola Snaps115005000500
Calla & Ivy125009000400
Inu + Neko120005000700
FocusVu2300070001000
Paola Snaps2150011000600
Calla & Ivy223007000400
Inu + Neko218005000200
FocusVu330005500550
Paola Snaps3500015000400
Calla & Ivy322009000500
Inu + Neko330006000700
FocusVu4350010000900
Paola Snaps4630050001100
Calla & Ivy420008500250
Inu + Neko434008000300
AdvertExpend Table

Which query will return the company, internet expenditure, and quarter in descending order of internet expenditure?

 SELECT Company, Internet
 FROM AdvertExpend
 ORDER BY Internet DESC
 SELECT Company, Internet, Quarter
 FROM AdvertExpend
 ORDER BY Quarter DESC
 SELECT Company, Internet, Quarter
 FROM AdvertExpend
 SELECT Company, Internet, Quarter (CORRECT)
 FROM AdvertExpend
 ORDER BY Internet DESC

(CORRECT)

Correct: Good Job!. Remember that the columns you enter after the SELECT clause will be returned in the return-set. So Company, Internet, and Quarter data values will be in the return-set. The ORDER BY arranges these rows relative to the values in the Internet column with the largest value in the first row. (Fourth Option)

2. Which of the following queries will return the unique companies found in the table and arrange them in alphabetical order?

 SELECT Company
 FROM AdvertExpend
 ORDER BY Company
  SELECT DISTINCT(Company) 
  FROM AdvertExpend
  ORDER BY Company DESC
  SELECT DISTINCT(Company)
  FROM AdvertExpend
  SELECT DISTINCT(Company) (CORRECT)
  FROM AdvertExpend
  ORDER BY Company

(CORRECT)

Correct: Good Job! The DISTINCT addition after the SELECT clause will ignore duplicate values in the return-set which is what is meant by the word distinct. We also order by the Company so that the returned values are in alphabetical order.

3. What will the following query have in its return-set?

 SELECT Company, Quarter, TV
 FROM AdvertExpend
 ORDER BY TV 
 LIMIT 4
  • The return-set will contain only the four companies that have spent the least amount of TV advertising.
  • The return-set will have company, quarter, and TV expenditure values for the four rows that have the least TV advertising values. (CORRECT)
  • The return-set will contain all the values found in the Company, Quarter, and TV columns.
  • The return-set will contain the company, quarter, and TV values for the four rows that have the highest values in the TV column.

Correct: Good Job! Since Company, Quarter, and TV were all selected, these columns will show up in the return set arranged from smallest TV value to largest. Yet, only the first four rows will end up being in the return-set.

PRACTICE QUIZ: SQL FILTERING + GROUPING

1. For the ALL the following questions use the table of advertising expenditures given below.

CompanyQuarterInternetTVPrint
FocusVu13,00050003000
Paola Snaps115005000500
Calla & Ivy125009000400
Inu + Neko120005000700
FocusVu2300070001000
Paola Snaps2150011000600
Calla & Ivy223007000400
Inu + Neko218005000200
FocusVu330005500550
Paola Snaps3500015000400
Calla & Ivy322009000500
Inu + Neko330006000700
FocusVu4350010000900
Paola Snaps4630050001100
Calla & Ivy420008500250
Inu + Neko434008000300
AdvertExpend Table

Which query will return the quarters for which Calla & Ivy had a print advertising expenditure less than $400?

 SELECT Quarter 
 FROM AdvertExpend
 WHERE Print = 400 AND Company = ‘Calla & Ivy’
 SELECT Quarter
 FROM AdvertExpend
 WHERE Print < 400 
 SELECT Print
 FROM AdvertExpend
 WHERE Quarter < 400 AND Company = ‘Calla & Ivy’
 SELECT Quarter (CORRECT)
 FROM AdvertExpend
 WHERE Print < 400 AND Company = ‘Calla & Ivy’

(CORRECT)

Correct: Well done! We select the column Quarter because those are the data values that we want returned. But we use the WHERE to restrict the return-set to only include values for Print that are less than 400 for the company Calla & Ivy.

2. Which query will get us the total TV expenditure in each quarter across all companies?

 SELECT Quarter, MAX(TV)
 FROM AdvertExpend
 GROUP BY Quarter
 SELECT Quarter, TV
 FROM AdvertExpend
 GROUP BY Quarter
 SELECT Quarter, SUM(TV)
 FROM AdvertExpend
 GROUP BY TV
 SELECT Quarter, SUM(TV) (CORRECT)
 FROM AdvertExpend
 GROUP BY Quarter

(CORRECT)

Correct: Good Job! This query will return the total amount spent on TV advertising in each quarter.

3. What will the following query have returned?

 SELECT AVG(Internet)
 FROM AdvertExpend
 WHERE Company = ‘Calla & Ivy’
 GROUP BY Quarter
  • The query will return Calla & Ivy’s average internet expenditure across all quarters.
  • The query will return the internet expenditure for an average quarter at Calla & Ivy.
  • The query will return Calla & Ivy’s total internet expenditure for each quarter.
  • The query will return Calla & Ivy’s average internet expenditure for each quarter. (CORRECT)

Correct: Good job! This query will return the average internet expenditures for Calla & Ivy each quarter.

PRACTICE QUIZ: FINDING AND REMOVING DUPLICATE RECORDS

1. Given that you have a Pandas DataFrame `df`, how would you export the DataFrame to the CSV file “my_data.csv”?

  • df.to_csv(“my_data.csv”) (CORRECT)
  • df.to_csv = “my_data.csv”
  • df.save(“my_data.csv”)
  • df[“my_data.csv”]

Correct: That’s right! To export the DataFrame to a CSV file, you would use `.to_csv()` with the name of the file as a parameter.

2. You’re given a CSV and load it to a DataFrame assigned to the variable `df`. Running `df.head()` gives you the following

Column 1Column 2Column 3Column 4
05.92heating1418twins
12.48santa cruz1171enrollment
25.23ought1616pork
33.13emily9442corey
42.20rising4620berry

What code would get you just the data for `Column 2`?

  • df = ‘Column 2
  • df[2]
  • df[‘Column 2’] (CORRECT)
  • df(‘Column 2’)

Correct: That’s right! We can get just one column of a DataFrame using the square-bracket notation, similar to how we used keys in Python dictionaries.

3. What code would get you how many rows and columns are in a DataFrame `df`?

  • df.shape()
  • df.size
  • df.size()
  • df.shape (CORRECT)

Correct: That’s right! The code `df.shape` will get us the rows and columns of the DataFrame since `shape` is an attribute of the DataFrame.

GRADED QUIZ: PANDAS AND SQL ASSESSMENT

1. You are given the following data loaded as a DataFrame `df` using Pandas.

C2W3.1
Meta Marketing Analytics – Course 2 Week 3 (Image 1)

What code would you write to get all the rows where `subscriptionTier` has the word “Free”?

  • df[ df[subscriptionTier].str.contains(“Free”) ] (CORRECT)
  • df[subscriptionTier].contains(“Free”)
  • df[ df[subscriptionTier].contains(“Free”) ]
  • df[subscriptionTier].str.contains(“Free”)

Correct: Correct! We use square brackets and the .str.contains() method to filter the data.

3. Say I have the given data loaded as a DataFrame `df` using Pandas. We know that the `customerID` column has a unique value (representing a different customer).

C2W3.1
Meta Marketing Analytics – Course 2 Week 3 (Image 2)

What code would I write to find out how many customers there are in each industry?

  • df.groupby(‘industry’)[‘customerID’].count() (CORRECT)
  • df[‘customerID’].groupby(‘industry’).count()
  • df.groupby(‘customerID’).count()

Correct: That’s correct! You first group by `industry` and then using just the `customerID` column, we get the count.

4. You can find the list of columns with `df.columns()`

  • T​rue
  • False (CORRECT)

Correct: Correct! DataFrames have an attribute `columns` that can be accessed using “dot notation” to get a list of the columns’ names in the DataFrame. You don’t use parentheses.

5. You are given the following data loaded as a DataFrame `df` using Pandas.

C2W3.1
Meta Marketing Analytics – Course 2 Week 3 (Image 3)

What code would you write to get all the rows where `industry` starts with the letters “Re”?

  • df[ df[‘industry’].contains(“Re”) ]
  • df[ df[‘industry’].str.contains(“Re”) ]
  • df[ df[‘industry’].str.startswith(“Re”) ] (CORRECT)
  • df[ df[‘industry’].startswith(“Re”) ]

Correct: Correct! We use square brackets and the .str.startswith() method to filter the data.

6. Say I have the given data loaded as a DataFrame `df` using Pandas. We know that the `customerID` column has a unique value (representing a different customer).

C2W3.1
Meta Marketing Analytics – Course 2 Week 3 (Image 4)

What code would I write to find out how many customers there are that have the security add on?

  • df.groupby(‘customerID’).count()
  • df[‘customerID’].groupby(‘addOn_security’).count()
  • df.groupby(‘customerID’)[‘addOn_security’].count()
  • df.groupby(‘addOn_security’)[‘customerID’].count() (CORRECT)

Correct: That’s correct! You first group by `addOn_security` and then using just the `customerID` column, we get the count.

7. Which code would you write to filter a DataFrame `df` so that you keep only the first instance of a duplicate?

  • df[ ~df.duplicated() ] (CORRECT)
  • df[ df.duplicated() ]
  • df[ df.duplicated(keep=False) ]
  • df[ ~df.duplicated(keep=False) ]

Correct: Correct!

8. You are given the following data loaded as a DataFrame `df` using Pandas. We know that the `customerID` column has a unique value (representing a different customer).

C2W3.1
Meta Marketing Analytics – Course 2 Week 3 (Image 5)

How would you interpret this output?

 df[ 'totalCosts' ].quantile(0.6)
 4.99
  • 60% of the data has a totalCost of 4.99 or more
  • 60% of the data has a value of 4.99 or less for the totalCost column (CORRECT)
  • 60% of the data has a totalCost of exactly 4.99
  • 40% of the data has a totalCost of 4.99 or less

Correct: Correct! The quantile method returns the cutoff value for the percentage given.

9. You are given the following data loaded as a DataFrame `df` using Pandas.

C2W3.1
Meta Marketing Analytics – Course 2 Week 3 (Image 6)

What code would you write to get just the single column `totalCost`?

  • df[‘totalCost’] (CORRECT)
  • df.column[‘totalCost’]
  • df.column(‘totalCost’)
  • df(‘totalCost’)

Correct: Correct! We use square brackets and a string representing the column name to get just one column of the DataFrame.

10. Say I have the given data loaded as a DataFrame `df` using Pandas. We know that the `customerID` column has a unique value (representing a different customer).

C2W3.1
Meta Marketing Analytics – Course 2 Week 3 (Image 7)

What code would I write to find out how many customers there are in each subscription tier?

  • df.groupby(‘customerID’).count()
  • df[‘customerID’].groupby(‘subscriptionTier’).count()
  • df.groupby(‘subscriptionTier’)[‘customerID’].count() (CORRECT)
  • df.groupby(‘customerID’)[‘subscriptionTier’].count()

Correct: That’s correct! You first group by `‘subscriptionTier’` and then using just the `customerID` column, we get the count.

11. You are given the following data loaded as a DataFrame `df` using Pandas. We know that the `customerID` column has a unique value (representing a different customer).

C2W3.1
Meta Marketing Analytics – Course 2 Week 3 (Image 8)
 df[ 'totalCosts' ].quantile(0.8)
 8.99
  • 80% of the data has a value of 8.99 or less for the totalCost column (CORRECT)
  • 20% of the data has a totalCost of exactly 8.99
  • 80% of the data has a totalCost of 8.99 or more
  • 20% of the data has a totalCost of 8.99 or less

Correct: Correct! The quantile method returns the cutoff value for the percentage given.

12. You are given the following data loaded as a DataFrame `df` using Pandas.

C2W3.1
Meta Marketing Analytics – Course 2 Week 3 (Image 9)

What code would you write to get all the rows where `industry` starts with the letter “R”?

  • df[ df[‘industry’].str.startswith(“R”) ] (CORRECT)
  • df[‘industry’].str.starts(“R”)
  • df[ df[‘industry’].startswith(“R”) ]
  • df[ df[‘industry’].starts(“R”) ]

Correct: Correct! We use square brackets and the .str.startswith() method to filter the data.

13. Which code would you write to filter a DataFrame `df` so that you only have values that had no duplicate rows?

  • df[ ~df.duplicated(keep=False) ]
  • df[ df.duplicated(keep=’False’) ]
  • df[ ~df.duplicated(keep=’last’) ] (CORRECT)
  • df[ df.duplicated(keep=’last’) ]

Correct:Correct!

14. You are given the following data loaded as a DataFrame `df` using Pandas. We know that the `customerID` column has a unique value (representing a different customer).

C2W3.1
Meta Marketing Analytics – Course 2 Week 3 (Image 10)

What code would I write to find out how many customers there are in each subscription tier?

  • df.value_counts()[‘subscriptionTier’]
  • df[‘subscriptionTier’].values
  • df[‘subscriptionTier’].value_counts() (CORRECT)
  • len(df[‘subscriptionTier’])

Correct: That’s correct! You first get the `subscriptionTier` column and then using the `value_counts()` method we can get the number of customers there are in each group.

15. You are given the following data loaded as a DataFrame `df` using Pandas.

C2W3.1
Meta Marketing Analytics – Course 2 Week 3 (Image 11)

What code would you write to get all the rows where `totalCost` is less than $6?

  • df[‘totalCost’] < 6.00
  • df[‘totalCost’].less_than(6.00)
  • df[ df[‘totalCost’].less_than(6.00) ]
  • df[ df[‘totalCost’] < 6.00 ] (CORRECT)

Correct: Correct! We use square brackets to enclose a mask which uses conditional statements to filter the data.

16. Which ways can you filter data using Google Sheets? Select all that apply.

  • By Font Size
  • By Color (CORRECT)
  • By Condition (CORRECT)
  • By Values (CORRECT)

Correct: Correct! This feature is useful if you’ve already color coded your dataset, but otherwise requires an additional step.

Correct: Correct! You saw filtering by condition using “text contains” conditions, “text does not contain” conditions, and “is greater than or equal to” conditions.

Correct: Correct! You can select specific column values which you wish to exclude by clicking them from the filter menu. You will see the check next to these values removed in the menu. Then click OK to apply the filter.

17. What will the following code give in the return set?

 SELECT DISTINCT company

 FROM advert

 ORDER BY company DESC

 LIMIT 5
  • It returns the list of all companies in reverse alphabetical order.
  • It returns the first five entries from the list of companies in alphabetical order.
  • It returns the first five entries from the list of companies in reverse alphabetical order. (CORRECT)
  • It returns the list of all companies in alphabetical order.

Correct: Excellent! This SQL query creates a list of the unique companies, orders them in reverse-alphabetical order, then returns the top five from that list.

18. For the data below

C2W3.2
Meta Marketing Analytics – Course 2 Week 3 (Image 12)

What are the resulting data for the SQL query:

 SELECT company, quarter, internet
 FROM adverts
 WHERE internet >= 500
  • The return-set will consist of the Company, Quarter, and Internet column values for only those rows where the internet value is 500 or greater. (CORRECT)
  • The return-set will consist of the Internet column values for only those rows where the internet value equals 500.
  • The return-set will consist of the Company, Quarter, and Internet column values for only those rows where the internet value is less than 500.
  • The return-set will consist of the Company and Internet column values for only those rows where the internet value is 500 or greater.

Correct: Excellent! This SQL command returns the data values in the columns for every row that has a value of 500 or more in the internet column.

19. The tenure column records how many months an individual has been a customer. Which code blocks would correctly filter the data to individuals who had been a customer for a year or more? Select all that apply.

  • 1 mask = df[ ‘Tenure’ ] == 12
  • 2 df[mask]
  • 1 df[df[ ‘Tenure’] >= 12] (CORRECT)
  • 1 mask = df[ ‘Tenure’ ] > 11 (CORRECT)
  • 2 df[mask]
  • 1 df[df[ ‘Tenure’] >= 12] (CORRECT)

Correct: Correct! This code filters the dataset in one line. Inside the square brackets is the mask, or conditional comparison.

Correct: Correct! This code block does subset the data to customers who have subscribed for a year or more. Using >= 12 like options A and D might be clearer to read though.

Correct: Correct! In the first line, you create a conditional mask based on the Tenure column. Then you apply this mask to filter the DataFrame in the second line.

20. Which code would correctly remove all duplicate values? Select all that apply.

  • 1 deduplicated = df.filter(df.duplicated())
  • 1 deduplicated = df[df.duplicated()]
  • 1 mask = ~df.duplicated() (CORRECT)
  • 2 deduplicated = df[mask]
  • 1 deduplicated = df[~df.duplicated()] (CORRECT)

Correct: Correct! This code will remove all duplicate records. The first record from every duplicate will be kept.

Correct: Correct! This code will remove all duplicate records. The first record from every duplicate will be kept, and it is written all as one line.

21. Which code correctly groups customers by subscription tier and returns the total monthly sales from each tier group?

  • df.groupby([‘subscriptionTier’, ‘totalCost’]).sum()
  • df.groupby(‘subscriptionTier’)[‘totalCost’].mean()
  • df.groupby(‘totalCost’)[‘subscriptionTier’].sum()
  • df.groupby(‘subscriptionTier’)[‘totalCost’].sum() (CORRECT)

Correct: Correct! This code groups the customer dataset by subscriptionTier and calculates the total sales for Focus Vu customers in each of these subscription tiers.

22. How would you interpret this output? Select all that apply.

1 df['totalCosts'].quantile(.75)
2 7.99
  • 25% of the data has a totalCosts of 7.99 or less.
  • 75% of the data has a totalCost of 7.99 or more.
  • 25% of the data has a totalCosts of 7.99 or more. (CORRECT)
  • 75% of the data has a value of 7.99 or less for the totalCosts column. (CORRECT)

Correct: Correct! The quantile method tells you that 75% of the data has a totalCost of 7.99 or less. Therefore the remaining 25% has a totalCost of 7.99 or more.

Correct: Correct! The quantile method returns the cutoff value for the percentage given. .75 is 75% So this query tells us that 75% of the data has a totalCosts of 7.99 or less.

Subscribe to our site

Get new content delivered directly to your inbox.

Liking our content? Then, don’t forget to ad us to your BOOKMARKS so you can find us easily!