COURSE 5: REGRESSION ANALYSIS: SIMPLIFY COMPLEX DATA RELATIONSHIPS

Module 3: Multiple Linear Regression

GOOGLE ADVANCED DATA ANALYTICS PROFESSIONAL CERTIFICATE

Complete Coursera Study Guide

INTRODUCTION – Multiple Linear Regression

In this comprehensive overview, participants will progress beyond simple regression into the realm of more complex models with a focus on multiple linear regression. Building on the foundational concepts of simple linear regression, this section guides learners through the intricacies of incorporating multiple variables into the modeling process. The course systematically explores how each step of multiple regression builds upon the principles established in simple linear regression, providing participants with a robust understanding of the entire modeling process.

Moreover, participants will receive a preview of key machine learning topics such as variable selection, overfitting, and the bias-variance tradeoff. This anticipatory glimpse into advanced concepts not only enriches the learning experience but also prepares participants for the broader landscape of machine learning applications. By combining theoretical understanding with practical application, this section serves as a crucial stepping stone for those aiming to master the complexities of multiple linear regression and gain a deeper insight into the nuances of machine learning principles.

Learning Objectives

  • Define Ridge and Lasso Regression
  • Use variable selection techniques
  • Define statistical power
  • Explain the relationship between variable selection and statistical power
  • Identify the potential of interaction terms in multiple regression
  • Define interaction terms in multiple linear regression
  • Use EDA to evaluate whether multiple regression is appropriate based on the model assumptions
  • Use EDA to identify multicollinearity
  • Define homoscedasticity and heteroscedasticity
  • Extend assumptions for simple linear regression to multiple regression
  • Explain how to handle categorical independent variables with one hot encoding
  • Distinguish between simple linear regression and multiple regression
  • Define multiple regression

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: UNDERSTAND MULTIPLE LINEAR REGRESSION

1. Fill in the blank: _____ is a technique that estimates the linear relationship between one continuous dependent variable and two or more independent variables.

  • Singular linear regression
  • Multiple curved regression
  • Singular curved regression
  • Multiple linear regression (CORRECT)

Correct: Multiple linear regression is a technique that estimates the linear relationship between one continuous dependent variable and two or more independent variables. The multiple regression technique can yield highly interpretable and communicable results.

2. What concept refers to how two independent variables together affect the y dependent variable?

  • One hot encoding
  • Interaction terms (CORRECT)
  • Ordinary least squares
  • Confidence band

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: MODEL ASSUMPTIONS REVISITED

1. Which of the following statements is true? Select all that apply.

  • One hot encoding is for ordinal variables.
  • One hot encoding allows data professionals to turn several categorical variables into one binary variable.
  • One hot encoding is a data transformation technique. (CORRECT)
  • One hot encoding allows data professionals to turn one categorical variable into several binary variables. (CORRECT)

Correct: One hot encoding is a data transformation technique that allows data professionals to turn one categorical variable into several binary variables.

2. What is the definition of the no multicollinearity assumption?

  • No predictor variable can be linearly related to the outcome variable.
  • No two independent variables can be highly correlated with each other. (CORRECT)
  • No observation in the dataset can be independent.
  • Variation of the residual must be constant or similar across the model.

Correct: Multicollinearity states that no two independent variables can be highly correlated with each other. This means that Xi and Xj cannot be linearly related.

3. In what ways might a data professional handle data with multicollinearity? Select all that apply.

  • Square the variables that have high multicollinearity.
  • Turn one categorical variable into several binary variables.
  • Create new variables using existing data. (CORRECT)
  • Drop one or more variables that have high multicollinearity. (CORRECT)

Correct: A data professional might handle data with multicollinearity by dropping one or more variables that have high multicollinearity. They might also create new variables using existing data.

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: MODEL INTERPRETATION

1. Fill in the blank: An interaction term represents how the relationship between two independent variables is associated with the changes in the _____ of the dependent variable.

  • category
  • multicollinearity
  • mean (CORRECT)
  • rate of change

Correct: An interaction term represents how the relationship between two independent variables is associated with the changes in the mean of the dependent variable. Typically, data professionals represent an interaction term as the product of the two independent variables in question.

2. Which of the following relevant statistics can be found by using statsmodel’s OLS function? Select all that apply.

  • Variance inflation factors
  • Standard errors (CORRECT)
  • Coefficients (CORRECT)
  • P-values (CORRECT)

Correct: Coefficients, standard errors, p-values, and t-statistics can be found by using statsmodel’s OLS function.

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: VARIABLE SELECTION AND MODEL EVALUATION

1. Fill in the blank: Adjusted R squared is a variation of the R squared regression evaluation metric that _____ unnecessary explanatory variables.

  • Adds
  • Eliminates
  • rewards
  • penalizes (CORRECT)

Correct: Adjusted R squared is a variation of the R squared regression evaluation metric that penalizes unnecessary explanatory variables. Similar to R squared, adjusted R squared varies from less than 0 to 1.

2. Which of the following statements accurately describe the differences between adjusted R squared and R squared? Select all that apply.

  • Adjusted R squared is more easily interpretable.
  • R squared is used to compare models of varying complexity.
  • Adjusted R squared is used to compare models of varying complexity. (CORRECT)
  • R squared is more easily interpretable. (CORRECT)

Correct: Adjusted R squared is used to compare models of varying complexity. R squared is more easily interpretable.

Correct: R squared determines how much variation in the dependent variable is explained by the model. Another difference is adjusted R squared is used to compare models of varying complexity.

3. What variable selection process begins with the full model that has all possible independent variables?

  • Forward selection
  • Backward elimination (CORRECT)
  • Extra-sum-of Squares
  • F-test

Correct: The backward elimination variable section process begins with the full model.

4. Which of the following are regularized regression techniques? Select all that apply.

  • F-test regression
  • Elastic-net regression (CORRECT)
  • Lasso regression (CORRECT)
  • Ridge regression (CORRECT)

Correct: Lasso regression, ridge regression, and elastic-net regression are regularized regression techniques.

QUIZ: MODULE 3 CHALLENGE

1. A data team working for an online magazine uses a regression technique to learn about advertising sales in different sections of the publication. They estimate the linear relationship between one continuous dependent variable and four independent variables. What technique are they using?

  • Multiple linear regression (CORRECT)
  • Simple linear regression
  • Interaction regression
  • Coefficient regression

2.What technique turns one categorical variable into several binary variables?

  • Multiple linear regression
  • Overfitting
  • One hot encoding (CORRECT)
  • Adjusted R squared

3. Which of the following is true regarding variance inflation factors? Select all that apply.

  • The larger the variable inflation factor, the less multicollinearity in the model.
  • The minimum value is 0.
  • The larger the variable inflation factor, the more multicollinearity in the model. (CORRECT)
  • The minimum value is 1. (CORRECT)

4. What term represents how the relationship between two independent variables is associated with changes in the mean of the dependent variable?

  • Normality term
  • Selection term
  • Interaction term (CORRECT)
  • Coefficient term

5. Which of the following statements accurately describe adjusted R squared? Select all that apply.

  • It is greater than 1.
  • It is a regression evaluation metric. (CORRECT)
  • It can vary from 0 to 1. (CORRECT)
  • It penalizes unnecessary explanatory variables. (CORRECT)

6. Which of the following statements accurately describe forward selection and backward elimination? Select all that apply.

  • Forward selection begins with the full model with all possible independent variables.
  • Forward selection begins with the full model with all possible dependent variables.
  • Forward selection begins with the null model and zero independent variables. (CORRECT)
  • Backward elimination begins with the full model with all possible independent variables. (CORRECT)

7. A data professional reviews model predictions for a human resources project. They discover that the model performs poorly on both the training data and the test holdout data, consistently predicting figures that are too low. This leads to inaccurate estimates about employee retention. What quality does this model have too much of?

  • Bias (CORRECT)
  • Entropy
  • Variance
  • Leakage

8. What regularization technique completely removes variables that are less important to predicting the y variable of interest?

  • Elastic net regression
  • Independent regression
  • Lasso regression (CORRECT)
  • Ridge regression

9. A data team with a restaurant group uses a regression technique to learn about customer loyalty and ratings. They estimate the linear relationship between one continuous dependent variable and two independent variables. What technique are they using?

  • Coefficient regression
  • Simple linear regression
  • Interaction regression
  • Multiple linear regression (CORRECT)

10. A data professional confirms that no two independent variables are highly correlated with each other. Which assumption are they testing for?

  • No multicollinearity assumption (CORRECT)
  • No linearity assumption
  • No normality assumption
  • No homoscedasticity assumption

11. What term represents the relationship for how two variables’ values affect each other?

  • Underfitting term
  • Linearity term
  • Interaction term (CORRECT)
  • Feature selection term

12. Which regression evaluation metric penalizes unnecessary explanatory variables?

  • Holdout sampling
  • Adjusted R squared (CORRECT)
  • Overfitting
  • Regression sampling

13. A data professional tells you that their model fails to adequately capture the relationship between the target variable and independent variables because it has too much bias. What is the most likely cause of the bias?

  • Underfitting (CORRECT)
  • Overfitting
  • Leakage
  • Entropy

14. What regularization technique minimizes the impact of less relevant variables, but drops none of the variables from the equation?

  • Lasso regression
  • Forward regression
  • Elastic net regression
  • Ridge regression (CORRECT)

15. Fill in the blank: The no multicollinearity assumption states that no two _____ variables can be highly correlated with each other.

  • Dependent
  • categorical
  • independent (CORRECT)
  • continuous

16. Fill in the blank: An interaction term represents how the relationship between two independent variables is associated with changes in the _____ of the dependent variable.

  • category
  • multicollinearity
  • assumption
  • mean (CORRECT)

17. A data professional uses an evaluation metric that penalizes unnecessary explanatory variables. Which metric are they using?

  • Link function
  • Adjusted R squared (CORRECT)
  • Ordinary least squares
  • Holdout sampling

18. What stepwise variable selection process begins with the full model with all possible independent variables?

  • Forward selection
  • Backward elimination (CORRECT)
  • Extra-sum-of-squares F-test
  • Overfit selection

19. A data analytics team creates a model for a project supporting their company’s sales department. The model performs very well on the training data, but it scores much worse when used to predict on new, unseen data. What does this model have too much of?

  • Entropy
  • Bias
  • Leakage
  • Variance (CORRECT)

20. A data professional at a car rental agency uses a regression technique to learn about how customers engage with various sections of the company website. They estimate the linear relationship between one continuous dependent variable and three independent variables. What technique are they using?

  • One hot encoding
  • Multiple linear regression (CORRECT)
  • Simple linear regression
  • Interaction terms

21. Which of the following are examples of categorical variables? Select all that apply.

  • Shirt inventory
  • Shirt country of manufacture (CORRECT)
  • Shirt type (CORRECT)
  • Shirt size (CORRECT)

22. Fill in the blank: One hot encoding is a data transformation technique that turns one categorical variable into several _____ variables.

  • independent
  • dependent
  • overfit
  • binary (CORRECT)

23. What stepwise variable selection process begins with the null model and zero independent variables?

  • Backward elimination
  • Holdout elimination
  • Forward selection (CORRECT)
  • Extra-sum-of-squares F-test

24. What data transformation technique turns one categorical variable into several binary variables?

  • Label encoding
  • Multiple regression
  • One hot encoding (CORRECT)
  • Adjusted R squared

Correct: Feedback: One hot encoding is a data transformation technique that turns one categorical variable into several binary variables. Data professionals use one hot encoding when there is a categorical independent variable and they need to represent the category as numbers.

25. Fill in the blank: The _____ states that no two independent variables (X−i  and X−j) can be highly correlated with each other.

  • no linearity assumption
  • no homoscedasticity assumption
  • no normality assumption
  • no multicollinearity assumption (CORRECT)

Correct: Feedback: The no multicollinearity assumption states that no two independent variables (X−i and X−j) can be highly correlated with each other.  This means that X−i and X−j cannot be linearly related to each other.

26. Fill in the blank: An interaction term represents the relationship between two independent variables and the change in the mean of the _____ variable.

  • global
  • instance
  • dependent (CORRECT)
  • independent

Correct: An interaction term represents the relationship between two independent variables and the change in the mean of the dependent variable. Typically, it is the product of the two independent variables.

27. What is the process of determining which variables or features to include in a given model?

  • Backward elimination
  • Extra-sum-of-squares F-test
  • Forward selection
  • Variable selection (CORRECT)

Correct: Variable selection is the process of determining which variables or features to include in a given model. Variable selection is iterative.

CONCLUSION – Multiple Linear Regression

In conclusion, this section serves as a pivotal juncture in the Google Advanced Data Analytics Certificate, guiding participants from the fundamentals of simple regression to the intricacies of multiple linear regression. As participants delve into the complexities of incorporating multiple variables into their models, they gain a comprehensive understanding of how to navigate the nuanced world of regression analysis. The anticipatory exploration of machine learning concepts further equips learners with the knowledge and skills needed to venture into the broader landscape of advanced analytics.

By combining theoretical insights with practical applications, participants not only solidify their grasp of regression modeling but also prepare themselves for the dynamic challenges presented by real-world data scenarios. This concluding reflection underscores the importance of this section as an indispensable component in the overall journey toward becoming proficient data professionals, capable of harnessing the power of regression analysis and machine learning to derive valuable insights from diverse datasets.