Missing Data and Multiple Imputation

I’ve got two examples here - one purely cross-sectional and one time-series cross-sectional. The TSCS one is a study by David Rueda (2008), which was replicated by Lall in his 2016 piece. The second is from the 2014 American National Election Study.

Rudea Replication

For the sake of this exercise, we’re going to replicate one of Rueda’s models and assess the quality of the imputations. You can read in and manage the data like below. This puts an object named x in your workspace that contains the data.

f <- file("https://quantoid.net/files/reg3/rueda.rda")
load(f)
close(f)
tmp <- x[,c("country", "year", "govem", "govpart", "hkcorp", 
"govpxcor", "open", "finop", "dccggx", "unemp", 
"gdpgr", "deca70", "deca80", "deca85", "deca90", "count")]
tmp$count <- as.factor(tmp$count)

Next we can make the listwise deleted data. We can do this as follows:

tmpNA <- na.omit(tmp[,c("country", "year", "govem", "govpart", "hkcorp", 
"govpxcor", "open", "finop", "dccggx", "unemp", 
"gdpgr", "deca70", "deca80", "deca85", "deca90", "count")])
  1. Estimate the model of govem (government employment) on govpart (cabinet partisanship), hkcorp (corporatism), govpxcor (cabinet partisanship x corporatism), open (internal openness), finop (financial openness), dccggx (government debt), unemp (unemployment), gdpgr (gdp growth), deca70, deca80, deca85, deca90 and count (year and country variables).

  2. Impute the data three different ways - mice with the default settings, mice using random forests (make sure that the time and year variables are in the model) as the imputation method and amelia. After doing that, re-estimate the models with the three different imputed datasets. How do the three models compare?

  3. Do the posterior predictive checks to check the quality of the imputations using all three methods. Is there any difference in what you find across the methods? Pick one missing data point from each variable with missing data and plot out the posterior distribution of the missing observations for the three different methods.

ANES 2014

Using lab3b_data.dta, do multiple imputation (with 5 imputations) on all of the variables in the dataset. You can download the data with:

library(haven)
dat <- read_dta('http://www.quantoid.net/files/reg3/lab3b_data.dta')
searchVarLabels(dat, "")
##                   ind
## indsocial           1
## indspend            2
## dem_edugroup        3
## dem_agegrp          4
## gender_respondent   5
## libcpre_self        6
## inc_incgroup_pre    7
## relig_chmember      8
##                                                                                         label
## indsocial           Index of social liberalism-conservatism (higher values more conservative)
## indspend          Index of economic liberalism-conservatism (higher values more conservative)
## dem_edugroup                                                    Education group of respondent
## dem_agegrp                                                            Age group of respondent
## gender_respondent                                                         Respondent's Gender
## libcpre_self                                                                     libcpre_self
## inc_incgroup_pre                                           Income group (pre-election survey)
## relig_chmember                                Member of a religious congregation/denomination
dat$dem_edugroup <- as_factor(dat$dem_edugroup)
levels(dat$dem_edugroup) <- c("ltHS", "HS", "Some Coll", "BA", "Grad Deg")
dat$dem_agegrp <- as_factor(dat$dem_agegrp)
dat$dem_agegrp <- car:::recode(dat$dem_agegrp, 
        'c("17-20", "21-24", "25-29") = "17-30"; c("30-34", "35-39", "40-44", "45-49") = "30-49"; 
         c("50-54", "55-59", "60-64", "65-69") = "50-69"; c("70-74", "Over 75") = "70+"', 
        as.factor = TRUE)
dat$gender_respondent <- as_factor(dat$gender_respondent)
dat <- as.data.frame(dat)
  1. See how the coefficients change in a model of libcpre_self on all the other variables in the dataset from listwise deletion to multiple imputation. Try mice, mice with random forest imputation and amelia each with 5 imputations. Are there differences between the models?

  2. How do the results change if you have use 25 imputations instead of 5?

Finite Mixture Models

Using the data above, estimate a finite mixture model where you assume that indsocial and indspend operationalize the two different theories (don’t use relig_chmember and inc_incgroup_pre in the models, yet.) Evaluate the resulting model against the linear model where there is an interaction between indsocial and indspend including the other controls.

  1. Run an OLS regression of libcpre_self_num on indsocial (a composite of social policy attitudes) and indspend (a composite of spending policy attitudes), their interaction and the controls gender_respondent, dem_edugroup and dem_agegrp_num.

  2. Estimate a finite mixture model where indsocial is the variable of interest in one component and indspend is the variable of interest in the other. That is set the indspend coefficient to zero in the model with indsocial and indsocial’s coefficient to zero in the model with indspend. Fix the coefficients on the other regressors to be constant across the two components.

  3. How do the two different models fit? Which do you think it better?

  4. Consider that income (an economic predictor operationalized by inc_incgroup_pre) and church membership (a social predictor operationalized by relig_chmember) might tell us something about the probabilities of being in one or the other group. Incorporate that information and see whether, in fact, they do provide information about group membership.