ICPSR Summer Program Teaching

Regression III

The Regression III course takes a considerably different form than the first two regression courses at the Summer Program. This course will hopefully prepare you for the things you will encounter when you (attempt to) publish quantitative work with linear models, and more complicated ones, too.

Initial linear model classes focus on the assumptions and theoretical considerations of linear models and generally walk you through estimation and interpretation. Good courses also deal with diagnostics, though these often get less time than they should. Further, it is not always obvious what violations of these assumptions will lead to in practical terms.

This course will provide you with a systematic approach to assessing, fixing and presenting your linear model results. Though we focus almost exclusively on the linear model (we will allude to nonlinear models occasionally), the logic we follow will be helpful in dealing with nonlinear models as well. More details can be found in the syllabus

Dave’s Office Hours: TBD

TAs

1 Introduction

In this lecture, we discuss the goals of the course and walk through some of the tools covered. This lecture is really to give you a sense of what you’ll learn in the next four weeks so you can make an informed decision about whether this workshop is for you. We’re going to use the chat in Gitter (rather than the one native in Zoom). To use the chat, you’ll need an account for either GitHub, GitLab or Twitter to login.

2 Effective Model Presentation I

This lecture covers some of the things we often gloss over when presenting linear model results. We discuss novel solutions to the reference category problem for categorical variables operationalized with dummy regressors, we discuss and consider solutions for the multiplicity problem in hypothesis testing. For students in the Regression III workshop, the solutions to the in-class exercises are available on UM’s Canvas page for the course.

3 Effective Model Presentation II

This lecture covers interactions effects. We discuss interactions in three scenarios - two categorical variables, two continuous variables and one of each. For each scenario, we discuss how to figure out whether an interaction exists and if so, how to understand what the interaction effect looks like. We also discuss centering and interactions, noting that centering doesn’t solve a statistical problem, but it could solve a problem with interpretation.

4 Linearity I

This lecture starts our discussion of linearity and diagnosing un-modeled non-linearity. Here we talk about what to do with ordinal variables on both sides of the regression equation. What do we do when variables could be considered as either categorical or continuous (e.g., Polity’s Democracy variable, seven-point indicators of party id). We discuss ways of testing assumptions about level of measurement and understanding when it is appropriate to use variables with few categories as though they were quantitative.

5 Linearity II

This lecture goes deeper into diagnosing problems of un-modeled non-linearity. We talk about component plus residual plots along with local polynomial regression for diagnosing problems as well as transformations and polynomials for fixing different kinds of non-linearity. These are pretty conventional tools for diagnosing and solving uncomplicated functional form problems. In later lectures we move to more automated tools for modeling arbitrary complicated relationships.

6 Relative Importance

In this short lecture discusses how we consider the differential impact variables have on the dependent variable. Standardized variables and relative importance measures can both help us compare the sizes of effects.

7 Bootstrapping

This lecture introduces the idea of bootstrapping for generating sampling distributions. This is useful for quantities with unknown sampling distributions or those where distributional assumptions are dubious. We focus mostly on bootstrapping the regression model, but work through an exercise of using the bootstrap to derive confidence intervals for local polynomial regression.

8 Model Comparison and Discrimination

This lecture discusses several methods we can use to discriminate between models, including a theoretical discussion of information criteria methods. We also discuss the Clarke test for non-nested models and talk about ways to extend the test with the clarkeTest package in R. In addition, we discussion model selection uncertainty and multi-model averaging.

9 Feature Selection and Regularization

This lecture takes a slightly different look at model selection. Rather than thinking about model testing as a theoretical enterprise, we discuss some exhaustive ways of searching the model space (given a set of candidate variables). Here, we discuss all subsets regression as well as regularization methods - ridge regression, LASSO, elastic nets and the adaptive LASSO. In particular, we focus on how these models respond to situations of high collinearity.

10 Splines

Sometimes the parametric form of non-linearity is unknown. Splines offer us a way of modeling non-linearities that is generally more flexible than polynomials, but has smaller sampling variability than local polynomial regression. Splines also allow us to test the adequacy of parametric non-linear models in the OLS/GLM context. This lecture discusses truncated power basis functions and B-splines for considering non-linear relationships.

11 Penalized Splines and GAMs.

We use the generalized additive models for location scale and shape (GAMLSS) framework to talk about penalized splines, which have some benefits compared unpenalized to regression splines (the subject of lecture 9). Here, we talk about how penalized splines work, how we can estimate monotonic relationships and how to compare across models. We also revisit interactions focusing on some recent research about linearity and non-linearity in interactions.

11 Model Diagnostics

In this lecture, we discuss conventional model diagnostics and their relationship to diagnostics we can leverage in the GAMLSS framework. Here, we will discuss outliers and robust regression, heteroskedasticity and higher moment modeling that is made easy in the GAMLSS framework as well as unified model diagnostics that exist in the GAMLSS framework.

12 Flexible Model Fitting

Here, we discuss other models that do not require strict functional form assumptions. These include Multivariate Adaptive Regression Splines (MARS), regularized polynomial expansions, and tree-based regression models (CART). We also talk about incorporating some of these features into the GAMLSS framework.

Software that Makes the Class Work

From time to time, people ask about the software I use in the course. Here is a brief discussion of the software I use in my teaching workflow.

  • The website you see that serves all of this content is hosted by my friend Tof at asocialfolder.
  • First and foremost, the course works because R and the user-developed packages that we use are all open-source.
  • I use the xaringan package to make slides in RMarkdown that are based on the remark.js framework. I also use the xaringan_themer package and a bit of custom CSS to style the slides.
  • I have used livecode to pretty good effect in classes. See the rstudio::conf 5 minute talk about livecode in R here. I use ngrok to serve my local private network files to the public along these lines.
  • As a backup for people experiencing technical issues, I use RStudio.cloudto ensure people have access to the software we use in the course.
  • I use the scribble add-in from xaringanExtra and occasionaly Demo Pro for live annotation of slides and screen.