COURSE 5: REGRESSION ANALYSIS: SIMPLIFY COMPLEX DATA RELATIONSHIPS

Module 5: Logistic Regression

GOOGLE ADVANCED DATA ANALYTICS PROFESSIONAL CERTIFICATE

Complete Coursera Study Guide

INTRODUCTION – Logistic Regression

In this section, participants will delve into the realm of binomial logistic regression, a powerful statistical technique designed to classify data into two distinct categories. The exploration begins with an in-depth investigation of the underlying principles and methodologies of binomial logistic regression. Participants will not only grasp the theoretical foundations but will also gain hands-on experience in constructing and interpreting these regression models.

As participants progress through the course, they will uncover the practical applications of binomial logistic regression in the toolkit of data professionals. The emphasis is on understanding how this regression analysis enables insightful data classification, providing a valuable skill set for making informed decisions in various contexts. The combination of theoretical knowledge and practical application equips participants to leverage binomial logistic regression effectively, reinforcing their capabilities as adept data professionals.

Learning Objectives

  • Identify the differences between binomial logistic regression and log-linear regression
  • Run a log-linear Poisson regression
  • Run a multinomial logistic regression in Python
  • Interpret the results of a binomial logistic regression model
  • Evaluate a binomial logistic regression model
  • Define confusion matrix, ROC, AUC, precision, recall, type 1 and type 2 error as they relate to binomial logistical regression
  • Run a binomial logistic regression in Python
  • Articulate the main assumptions of binomial logistic regression
  • Explain the relevance of data characteristics to choosing between regression models
  • Explain the use of the sigmoid function in a binomial logistic regression
  • Define a binomial logistic regression

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: FOUNDATIONS OF LOGISTIC REGRESSION

1. No extreme outliers is one of the four main binomial logistic regression assumptions. What are the other three? Select all that apply.

  • Homoscedasticity
  • Linearity (CORRECT)
  • Independent observations (CORRECT)
  • No multicollinearity (CORRECT)

Correct: The four main binomial logistic regression assumptions are: linearity, independent observations, no multicollinearity, and no extreme outliers.

2. Logit is the logarithm of the odds of a given probability.

  • True (CORRECT)
  • False

Correct: Logit is the logarithm of the odds of a given probability. It is the most common link function used to linearly relate the X variables to the probability of Y.

3. Fill in the blank: The maximum likelihood estimation is a technique used for estimating the beta parameters that _____ the likelihood of a model producing the observed data.

  • control
  • balance
  • maximize (CORRECT)
  • reduce

Correct: The maximum likelihood estimation is a technique used for estimating the beta parameters that maximize the likelihood of a model producing the observed data.

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: LOGISTIC REGRESSION WITH PYTHON

1. When building a logistic regression model, what does CLF stand for?

  • Claimer
  • Codifier
  • Connector
  • Classifier (CORRECT)

2. Which package do you use to create a plot of your model to visualize its results?

  • Dashboard package
  • Matrix package
  • Results package
  • Seaborn package (CORRECT)

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: INTERPRET LOGISTIC REGRESSION RESULTS

1. The confusion matrix is a graphical representation of how accurate a classifier is at predicting what for a categorical variable?

  • Validity
  • Errors
  • Labels (CORRECT)
  • Precision

Correct: The confusion matrix is a graphical representation of how accurate a classifier is at predicting the labels for a categorical variable. It displays how many data points were accurately categorized by the classifier for each category. The other squares in a grid convey how many data points were misclassified.

2. Fill in the blank: _____ measures the proportion of positive predictions that were true positives.

  • Accuracy
  • Validity
  • Precision (CORRECT)
  • Recall

Correct: Precision measures the proportion of positive predictions that were true positives.

3. Which of the following provide additional information about the likelihood of a result being merely by chance? Select all that apply.

  • Maximum likelihood estimation
  • Logit
  • Confidence intervals (CORRECT)
  • P-value (CORRECT)

Correct: The p-value and confidence intervals provide additional information about the likelihood of a result being merely by chance.

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: COMPARE REGRESSION MODELS

1. Which model might a data professional consider first if the outcome variable is binary?

  • Single linear regression
  • Multiple linear regression
  • Binomial logistic regression (CORRECT)
  • Hypothesis testing

Correct: If the outcome variable is binary, a data professional might consider a binomial logistic regression model. After building the model, the best way to test if logistic regression is the right choice is to evaluate the model with metrics.

2. A data professional can use recall to evaluate a logistic regression model. What other metrics can be used to meet this goal? Select all that apply.

  • R squared
  • Precision (CORRECT)
  • Confusion matrices (CORRECT)
  • P-value (CORRECT)

Correct: To evaluate a logistic regression model, a data professional can use recall, p-value, confusion matrices, and precision

QUIZ: MODULE 5 CHALLENGE

1. Fill in the blank: Binomial logistic regression is a technique that models the _____ of an observation falling into one of two categories, based on one or more independent variables.

  • Probability (CORRECT)
  • determinant
  • implications
  • causations

2. A data professional calculates a logarithm of the odds of a given probability. What are they calculating?

  • Likelihood
  • Precision
  • Logit (CORRECT)
  • Recall

3. Fill in the blank: Maximum likelihood estimation is a technique for estimating the _____ that maximize the likelihood of the model producing the observed data.

  • beta parameters (CORRECT)
  • continuous coefficients
  • error terms
  • continuous parameters

4. Following the no extreme outlier assumption, when are outliers detected?

  • Either before or after the model is fit
  • After the model is fit (CORRECT)
  • Before the model is fit
  • While the model is being fit

5. What graphical representation demonstrates a classifier’s accuracy at predicting the labels for a categorical variable?

  • Logistic matrix
  • Logistic graph
  • Likelihood matrix
  • Confusion matrix (CORRECT)

6. A data professional calculates precision in logistic regression results. They have 101 true positives, 63 true negatives, 4 false positives, and 2 false negatives. What is the calculation for precision?

  • 101 / (101 + 4) (CORRECT)
  • (101 + 2) / 4
  • (63 + 4) / 101
  • 101 / (63 + 2)

7. A data professional calculates accuracy in logistic regression results. They have 99 true positives, 91 true negatives, and 248 total predictions. What is the calculation for accuracy?

  • 248 / (99 + 91)
  • (248 – 99) / 91
  • 99 / (248 – 91)
  • (99 + 91) / 248 (CORRECT)

8. A data professional calculates recall in logistic regression results. They have 145 true positives, 128 true negatives, 4 false positives, and 2 false negatives. What is the calculation for recall?

  • 145 / (145 + 2) (CORRECT)
  • (128 + 2) / 128
  • (145 + 128) / (4 + 2)
  • (4 – 2) / 145

9. What technique models the probability of an observation falling into one of two categories, based on one or more independent variables?

  • Maximum likelihood estimation
  • Linear regression
  • Log-odds function
  • Binomial logistic regression (CORRECT)

10. What is the logit formula?

  • Logarithm of p divided by 1 minus p (CORRECT)
  • Logarithm of 1 divided by p minus 1
  • Logarithm of p plus 1 divided by p
  • Logarithm of 1 plus p divided by p

11. What technique estimates the beta parameters that increase the likelihood of the model producing observed data?

  • Precision
  • Maximum likelihood estimation (CORRECT)
  • Recall
  • Accuracy

12. Which regression assumption states that, if multiple X variables are in a model, they should not be highly correlated with one another?

  • Linearity
  • No multicollinearity (CORRECT)
  • Independent observations
  • No extreme outliers

13. Fill in the blank: A confusion matrix is a graphical representation of how accurate a classifier is at _____ the labels for a categorical variable.

  • spacing
  • predicting (CORRECT)
  • organizing
  • limiting

14. A data professional calculates precision in logistic regression results. They have 89 true positives, 83 true negatives, 3 false positives, and 1 false negative. What is the calculation for precision?

  • 89 / (83 + 1)
  • (89 + 1) / 3
  • (83 + 3) / 89
  • 89 / (89 + 3) (CORRECT)

15. A data professional calculates accuracy in logistic regression results. They have 82 true positives, 75 true negatives, and 202 total predictions. What is the calculation for accuracy?

  • (82 + 75) / 202 (CORRECT)
  • 202 / (82 + 75)
  • 82 / (202 – 75)
  • (202 – 82) / 75

16. A data professional calculates recall in logistic regression results. They have 91 true positives, 84 true negatives, 6 false positives, and 5 false negatives. What is the calculation for recall?

  • (84 + 5) / 84
  • 91 / (91 + 5) (CORRECT)
  • (91 – 6) / (84 – 5)
  • 84 / (84 + 6)

17. Logit includes which other probability formula?

  • Precision
  • Odds (CORRECT)
  • Recall
  • Estimation

18. Fill in the blank: A confusion matrix is a graphical representation of how accurate a classifier is at predicting the labels for a _____ variable.

  • Categorical (CORRECT)
  • Confidence
  • correlated
  • continuous

19. Precision measures the proportion of positive predictions that were false positives.

  • True
  • False (CORRECT)

Correct: Precision measures the proportion of positive predictions that were true positives. Precision is equal to the number of true positives, divided by the sum of true positives and false positives.

20. A data professional calculates accuracy in logistic regression results. They have 87 true positives, 94 true negatives, and 222 total predictions. What is the calculation for accuracy?

  • 222 / (87 + 94)
  • (87 + 94) / 222 (CORRECT)
  • (222 – 87) / 94
  • 87 / (222 – 94)

21. A data professional calculates recall in logistic regression results. They have 99 true positives, 80 true negatives, 7 false positives, and 4 false negatives. What is the calculation for recall?

  • 80 / (80 + 7)
  • (99 – 7) / (80 – 4)
  • (84 + 4) / 80 
  • 99 / (99 + 4) (CORRECT)

22. Fill in the blank: Maximum likelihood estimation is a technique for _____ the beta parameters that maximize the likelihood of a model producing the observed data.

  • Limiting
  • duplicating
  • eliminating
  • estimating (CORRECT)

23. For the binomial logistic regression linearity assumption, there should be a linear relationship between each X variable and what logit probability?

  • X equals 1
  • Y equals 0 (CORRECT)
  • X equals Y
  • Y equals 1

CONCLUSION – Logistic Regression

In conclusion, participants have navigated a comprehensive journey through the intricacies of binomial logistic regression, a pivotal statistical method in data analysis. This course has not only demystified the theoretical underpinnings of binomial logistic regression but has also empowered participants with practical skills in constructing and interpreting regression models.

As participants embark on applying this knowledge in real-world scenarios, they are well-equipped to leverage binomial logistic regression as a valuable tool for classifying data and extracting meaningful insights. This foundational understanding adds a significant layer to their proficiency as data professionals, enhancing their ability to contribute effectively to decision-making processes in diverse domains.