I’ve got two examples here - one purely cross-sectional and one time-series cross-sectional. The TSCS one is a study by David Rueda (2008), which was replicated by Lall in his 2016 piece. The second is from the 2014 American National Election Study.
For the sake of this exercise, we’re going to replicate one of Rueda’s models and assess the quality of the imputations. You can read in and manage the data like below. This puts an object named x
in your workspace that contains the data.
f <- file("https://quantoid.net/files/reg3/rueda.rda")
load(f)
close(f)
tmp <- x[,c("country", "year", "govem", "govpart", "hkcorp",
"govpxcor", "open", "finop", "dccggx", "unemp",
"gdpgr", "deca70", "deca80", "deca85", "deca90", "count")]
tmp$count <- as.factor(tmp$count)
Next we can make the listwise deleted data. We can do this as follows:
tmpNA <- na.omit(tmp[,c("country", "year", "govem", "govpart", "hkcorp",
"govpxcor", "open", "finop", "dccggx", "unemp",
"gdpgr", "deca70", "deca80", "deca85", "deca90", "count")])
Estimate the model of govem
(government employment) on govpart
(cabinet partisanship), hkcorp
(corporatism), govpxcor
(cabinet partisanship x corporatism), open
(internal openness), finop
(financial openness), dccggx
(government debt), unemp
(unemployment), gdpgr
(gdp growth), deca70
, deca80
, deca85
, deca90
and count
(year and country variables).
Impute the data three different ways - mice
with the default settings, mice
using random forests (make sure that the time and year variables are in the model) as the imputation method and amelia
. After doing that, re-estimate the models with the three different imputed datasets. How do the three models compare?
Do the posterior predictive checks to check the quality of the imputations using all three methods. Is there any difference in what you find across the methods? Pick one missing data point from each variable with missing data and plot out the posterior distribution of the missing observations for the three different methods.
Using lab3b_data.dta
, do multiple imputation (with 5 imputations) on all of the variables in the dataset. You can download the data with:
library(haven)
dat <- read_dta('http://www.quantoid.net/files/reg3/lab3b_data.dta')
searchVarLabels(dat, "")
## ind
## indsocial 1
## indspend 2
## dem_edugroup 3
## dem_agegrp 4
## gender_respondent 5
## libcpre_self 6
## inc_incgroup_pre 7
## relig_chmember 8
## label
## indsocial Index of social liberalism-conservatism (higher values more conservative)
## indspend Index of economic liberalism-conservatism (higher values more conservative)
## dem_edugroup Education group of respondent
## dem_agegrp Age group of respondent
## gender_respondent Respondent's Gender
## libcpre_self libcpre_self
## inc_incgroup_pre Income group (pre-election survey)
## relig_chmember Member of a religious congregation/denomination
dat$dem_edugroup <- as_factor(dat$dem_edugroup)
levels(dat$dem_edugroup) <- c("ltHS", "HS", "Some Coll", "BA", "Grad Deg")
dat$dem_agegrp <- as_factor(dat$dem_agegrp)
dat$dem_agegrp <- car:::recode(dat$dem_agegrp,
'c("17-20", "21-24", "25-29") = "17-30"; c("30-34", "35-39", "40-44", "45-49") = "30-49";
c("50-54", "55-59", "60-64", "65-69") = "50-69"; c("70-74", "Over 75") = "70+"',
as.factor = TRUE)
dat$gender_respondent <- as_factor(dat$gender_respondent)
dat <- as.data.frame(dat)
See how the coefficients change in a model of libcpre_self
on all the other variables in the dataset from listwise deletion to multiple imputation. Try mice
, mice
with random forest imputation and amelia
each with 5 imputations. Are there differences between the models?
How do the results change if you have use 25 imputations instead of 5?
Using the data above, estimate a finite mixture model where you assume that indsocial
and indspend
operationalize the two different theories (don’t use relig_chmember
and inc_incgroup_pre
in the models, yet.) Evaluate the resulting model against the linear model where there is an interaction between indsocial
and indspend
including the other controls.
Run an OLS regression of libcpre_self_num
on indsocial
(a composite of social policy attitudes) and indspend
(a composite of spending policy attitudes), their interaction and the controls gender_respondent
, dem_edugroup
and dem_agegrp_num
.
Estimate a finite mixture model where indsocial
is the variable of interest in one component and indspend
is the variable of interest in the other. That is set the indspend
coefficient to zero in the model with indsocial
and indsocial
’s coefficient to zero in the model with indspend
. Fix the coefficients on the other regressors to be constant across the two components.
How do the two different models fit? Which do you think it better?
Consider that income (an economic predictor operationalized by inc_incgroup_pre
) and church membership (a social predictor operationalized by relig_chmember
) might tell us something about the probabilities of being in one or the other group. Incorporate that information and see whether, in fact, they do provide information about group membership.