ICPSR Summer Program Teaching

Regression III

The Regression III course takes a considerably different form than the first two regression courses at the Summer Program. This course will hopefully prepare you for the things you will encounter when you (attempt to) publish quantitative work with linear models, and more complicated ones, too.

Initial linear model classes focus on the assumptions and theoretical considerations of linear models and generally walk you through estimation and interpretation. Good courses also deal with diagnostics, though these often get less time than they should. Further, it is not always obvious what violations of these assumptions will lead to in practical terms.

This course will provide you with a systematic approach to assessing, fixing and presenting your linear model results. Though we focus almost exclusively on the linear model (we will allude to nonlinear models occasionally), the logic we follow will be helpful in dealing with nonlinear models as well. More details can be found in the syllabus

Dave’s Office Hours: TBD

TAs

1 Introduction

In this lecture, we discuss the goals of the course and walk through some of the tools covered. This lecture is really to give you a sense of what you’ll learn in the next four weeks so you can make an informed decision about whether this workshop is for you. We’re going to use the chat in Gitter (rather than the one native in Zoom). To use the chat, you’ll need an account for either GitHub, GitLab or Twitter to login.

2 Effective Model Presentation I

This lecture covers some of the things we often gloss over when presenting linear model results. We discuss novel solutions to the reference category problem for categorical variables operationalized with dummy regressors, we discuss and consider solutions for the multiplicity problem in hypothesis testing and we discuss several methods whereby researchers can compare effect sizes across variables. For students in the Regression III workshop, the solutions to the in-class exercises are available on UM’s Canvas page for the course.

  • Slides pdf
  • Code r
  • Exercises rmd
  • livecode file r

3 Effective Model Presentation II

This lecture covers interactions effects. We discuss interactions in three scenarios - two categorical variables, two continuous variables and one of each. For each scenario, we discuss how to figure out whether an interaction exists and if so, how to understand what the interaction effect looks like. We also discuss centering and interactions, noting that centering doesn’t solve a statistical problem, but it could solve a problem with interpretation.

  • Slides pdf
  • Code r
  • Exercises rmd
  • livecode file r

Homework 1

This homework lets you try out the tools you learned to deal with interactions. You can download the file below or access it via the Canvas site.

4 Linearity I

This lecture starts our discussion of linearity and diagnosing un-modeled non-linearity. Here we talk about what to do with ordinal variables on both sides of the regression equation. What do we do when variables could be considered as either categorical or continuous (e.g., Polity’s Democracy variable, seven-point indicators of party id). We discuss ways of testing assumptions about level of measurement and understanding when it is appropriate to use variables with few categories as though they were quantitative.

  • Slides pdf
  • Code r
  • Exercises rmd
  • livecode file r

5 Linearity II

This lecture goes deeper into diagnosing problems of un-modeled non-linearity. We talk about component plus residual plots along with local polynomial regression for diagnosing problems as well as transformations and polynomials for fixing different kinds of non-linearity. These are pretty conventional tools for diagnosing and solving uncomplicated functional form problems. In later lectures we move to more automated tools for modeling arbitrary complicated relationships.

  • Slides pdf
  • Code r
  • Exercises rmd
  • livecode file r

6 Bootstrapping

This lecture introduces the idea of bootstrapping for generating sampling distributions. This is useful for quantities with unknown sampling distributions or those where distributional assumptions are dubious. We focus mostly on bootstrapping the regression model, but work through an exercise of using the bootstrap to derive confidence intervals for local polynomial regression.

  • Slides pdf
  • Code r
  • Exercises rmd
  • livecode file r

7 Model Comparison and Discrimination

This lecture discusses several methods we can use to discriminate between models, including a theoretical discussion of information criteria methods. We also discuss the Clarke test for non-nested models and talk about ways to extend the test with the clarkeTest package in R. In addition, we discussion model selection uncertainty and multi-model averaging.

8 Feature Selection and Regularization

This lecture takes a slightly different look at model selection. Rather than thinking about model testing as a theoretical enterprise, we discuss some exhaustive ways of searching the model space (given a set of candidate variables). Here, we discuss all subsets regression as well as regularization methods - ridge regression, LASSO, elastic nets and the adaptive LASSO. In particular, we focus on how these models respond to situations of high collinearity.

9 Splines

Sometimes the parametric form of non-linearity is unknown. Splines offer us a way of modeling non-linearities that is generally more flexible than polynomials, but has smaller sampling variability than local polynomial regression. Splines also allow us to test the adequacy of parametric non-linear models in the OLS/GLM context. This lecture discusses truncated power basis functions and B-splines for considering non-linear relationships.

10 Penalized Splines and GAMs.

We use the generalized additive models for location scale and shape (GAMLSS) framework to talk about penalized splines, which have some benefits compared unpenalized to regression splines (the subject of lecture 9). Here, we talk about how penalized splines work, how we can estimate monotonic relationships and how to compare across models. We also revisit interactions focusing on some recent research about linearity and non-linearity in interactions.

  • Slides pdf
  • Code r
  • Exercises rmd
  • livecode html
  • Smoothing Spline Simulation html

11 Model Diagnostics

In this lecture, we discuss conventional model diagnostics and their relationship to diagnostics we can leverage in the GAMLSS framework. Here, we will discuss outliers and robust regression, heteroskedasticity and higher moment modeling that is made easy in the GAMLSS framework as well as unified model diagnostics that exist in the GAMLSS framework.

12 Flexible Model Fitting

Here, we discuss other models that do not require strict functional form assumptions. These include Multivariate Adaptive Regression Splines (MARS), regularized polynomial expansions, and tree-based regression models (CART). We also talk about incorporating some of these features into the GAMLSS framework.

13 Multiple Imputation

Missing data is a ubiquitous problem in social science data analysis. We discuss the problems caused by missing data and multiple imputation as a way to characterize the uncertainty in our models owing to missing data. We also discuss sensitivity analysis to some of the assumptions that drive the suitability of multiple imputation for solving these problems.

14 Finite Mixtures

For much of the second half of the course, we’ve talked about model testing and feature selection - identifying which features should be in the model and in what form. Finite mixture models are in essence a way of observation selection - identifying which observations are best explained by a small, fixed set of models.

Other stuff

We don’t have time to talk about all of the interesting regression stuff in the class, so here are some slides to other topics that may be of interest:

Software that Makes the Class Work

From time to time, people ask about the software I use in the course. Here is a brief discussion of the software I use in my teaching workflow.

  • The website you see that serves all of this content is hosted by my friend Tof at asocialfolder.
  • First and foremost, the course works because R and the user-developed packages that we use are all open-source.
  • I use the xaringan package to make slides in RMarkdown that are based on the remark.js framework. I also use the xaringan_themer package and a bit of custom CSS to style the slides.
  • The slides get printed from html to PDF using the decktape.js library.
  • I have used livecode to pretty good effect in classes. See the rstudio::conf 5 minute talk about livecode in R here. I use ngrok to serve my local private network files to the public along the lines of this post.
  • As a backup for people experiencing technical issues, I use RStudio.cloudto ensure people have access to the software we use in the course.
  • I use Doceri for live annotation of slides.