The Regression III course takes a considerably different form than the first two regression courses at the Summer Program. This course will hopefully prepare you for the things you will encounter when you (attempt to) publish quantitative work with linear models, and more complicated ones, too.
Initial linear model classes focus on the assumptions and theoretical considerations of linear models and generally walk you through estimation and interpretation. Good courses also deal with diagnostics, though these often get less time than they should. Further, it is not always obvious what violations of these assumptions will lead to in practical terms.
This course will provide you with a systematic approach to assessing, fixing and presenting your linear model results. Though we focus almost exclusively on the linear model (we will allude to nonlinear models occasionally), the logic we follow will be helpful in dealing with nonlinear models as well. More details can be found in the syllabus
Dave’s Office Hours: TBD
TAs
-
Chris Schwarz (NYU, Political Science)
Office Hours: TBD -
Kathryn Overton (University of New Mexico, Political Science)
Office Hours : TBD
1 Introduction
In this lecture, we discuss the goals of the course and walk through some of the tools covered. This lecture is really to give you a sense of what you’ll learn in the next four weeks so you can make an informed decision about whether this workshop is for you. We’re going to use the chat in Gitter (rather than the one native in Zoom). To use the chat, you’ll need an account for either GitHub, GitLab or Twitter to login.
- Slides html, PDF 1 per page, PDF 4 per page
2 Effective Model Presentation I
This lecture covers some of the things we often gloss over when presenting linear model results. We discuss novel solutions to the reference category problem for categorical variables operationalized with dummy regressors, we discuss and consider solutions for the multiplicity problem in hypothesis testing. For students in the Regression III workshop, the solutions to the in-class exercises are available on UM’s Canvas page for the course.
- Slides html, PDF 1 per page, PDF 4 per page
- Code r
- Exercises rmd
- livecode html
3 Effective Model Presentation II
This lecture covers interactions effects. We discuss interactions in three scenarios - two categorical variables, two continuous variables and one of each. For each scenario, we discuss how to figure out whether an interaction exists and if so, how to understand what the interaction effect looks like. We also discuss centering and interactions, noting that centering doesn’t solve a statistical problem, but it could solve a problem with interpretation.
- Slides html, PDF 1 per page, PDF 4 per page
- Code r
- Exercises rmd
- livecode html
4 Linearity I
This lecture starts our discussion of linearity and diagnosing un-modeled non-linearity. Here we talk about what to do with ordinal variables on both sides of the regression equation. What do we do when variables could be considered as either categorical or continuous (e.g., Polity’s Democracy variable, seven-point indicators of party id). We discuss ways of testing assumptions about level of measurement and understanding when it is appropriate to use variables with few categories as though they were quantitative.
- Slides html, PDF 1 per page, PDF 4 per page
- Code r
- Exercises rmd
5 Linearity II
This lecture goes deeper into diagnosing problems of un-modeled non-linearity. We talk about component plus residual plots along with local polynomial regression for diagnosing problems as well as transformations and polynomials for fixing different kinds of non-linearity. These are pretty conventional tools for diagnosing and solving uncomplicated functional form problems. In later lectures we move to more automated tools for modeling arbitrary complicated relationships.
- Slides html, PDF 1 per page, PDF 4b per page
- Code r
- Exercises rmd
6 Relative Importance
In this short lecture discusses how we consider the differential impact variables have on the dependent variable. Standardized variables and relative importance measures can both help us compare the sizes of effects.
- Slides html, PDF 1 per page, PDF 4 per page
- Code r
7 Bootstrapping
This lecture introduces the idea of bootstrapping for generating sampling distributions. This is useful for quantities with unknown sampling distributions or those where distributional assumptions are dubious. We focus mostly on bootstrapping the regression model, but work through an exercise of using the bootstrap to derive confidence intervals for local polynomial regression.
- Slides html, PDF 1 per page, PDF 4 per page
- Code r
8 Model Comparison and Discrimination
This lecture discusses several methods we can use to discriminate between models, including a theoretical discussion of information criteria methods. We also discuss the Clarke test for non-nested models and talk about ways to extend the test with the clarkeTest
package in R. In addition, we discussion model selection uncertainty and multi-model averaging.
- Slides html, PDF 1 per page, PDF 4 per page
- Code r
9 Feature Selection and Regularization
This lecture takes a slightly different look at model selection. Rather than thinking about model testing as a theoretical enterprise, we discuss some exhaustive ways of searching the model space (given a set of candidate variables). Here, we discuss all subsets regression as well as regularization methods - ridge regression, LASSO, elastic nets and the adaptive LASSO. In particular, we focus on how these models respond to situations of high collinearity.
- Slides html, PDF 1 per page, PDF 4 per page
- Code r
10 Splines
Sometimes the parametric form of non-linearity is unknown. Splines offer us a way of modeling non-linearities that is generally more flexible than polynomials, but has smaller sampling variability than local polynomial regression. Splines also allow us to test the adequacy of parametric non-linear models in the OLS/GLM context. This lecture discusses truncated power basis functions and B-splines for considering non-linear relationships.
- Slides html, PDF 1 per page, PDF 4 per page
- Code r
11 Penalized Splines and GAMs.
We use the generalized additive models for location scale and shape (GAMLSS) framework to talk about penalized splines, which have some benefits compared unpenalized to regression splines (the subject of lecture 9). Here, we talk about how penalized splines work, how we can estimate monotonic relationships and how to compare across models. We also revisit interactions focusing on some recent research about linearity and non-linearity in interactions.
- Slides html, PDF 1 per page, PDF 4 per page
- Code r
11 Model Diagnostics
In this lecture, we discuss conventional model diagnostics and their relationship to diagnostics we can leverage in the GAMLSS framework. Here, we will discuss outliers and robust regression, heteroskedasticity and higher moment modeling that is made easy in the GAMLSS framework as well as unified model diagnostics that exist in the GAMLSS framework.
- Slides html, PDF 1 per page, PDF 4 per page
- Code r
12 Flexible Model Fitting
Here, we discuss other models that do not require strict functional form assumptions. These include Multivariate Adaptive Regression Splines (MARS), regularized polynomial expansions, and tree-based regression models (CART). We also talk about incorporating some of these features into the GAMLSS framework.
- Slides html, PDF 1 per page, PDF 4 per page
- Code r
Software that Makes the Class Work
From time to time, people ask about the software I use in the course. Here is a brief discussion of the software I use in my teaching workflow.
- The website you see that serves all of this content is hosted by my friend Tof at asocialfolder.
- First and foremost, the course works because R and the user-developed packages that we use are all open-source.
- I use the xaringan package to make slides in RMarkdown that are based on the remark.js framework. I also use the
xaringan_themer
package and a bit of custom CSS to style the slides. - I have used livecode to pretty good effect in classes. See the
rstudio::conf
5 minute talk about livecode in R here. I use ngrok to serve my local private network files to the public along these lines. - As a backup for people experiencing technical issues, I use RStudio.cloudto ensure people have access to the software we use in the course.
- I use the scribble add-in from xaringanExtra and occasionaly Demo Pro for live annotation of slides and screen.