Both of the questions in this lab use state repression as the dependent variable. In a general sense, state repression is the violation of human rights by the state. In this case, the focus is on the set of “physical integrity rights” - the rights to be free from torture, political imprisonment, extrajudicial killing and forced disappearance.
This question uses the q1data.rda
file (which wil put an object in your workspace called q1data
). You can get this file by either downloading it from the course website https://quantoid.net/teachicpsr/regression3 or by doing the following in R:
q1 <- file("https://quantoid.net/files/reg3/q1data.rda")
load(q1)
close(q1)
This file has a number of variables. You can find a short description of each with:
searchVarLabels(q1data, "")
## ind label
## ccode 1 COW code
## Year 2 Year
## polity2 3 Polity2
## pop 4 Population
## all_phones_pc 5 All Telephones, including Cellular, Per Capita
## gdppc_mp 6 Gross National Product Per Capita (Market Prices)
## revols 7 Revolutions
## riots 8 Riots
## agdems 9 Anti-Government Demonstrations
## terror_incidents 10 Number of terrorist incidents
## physint 11 CIRI Physical Integrity Index
The dependent variable is going to be physint
the CIRI physical integrity rights index. All of the other variables, except for ccode
and Year
will be the independent variables.
alsosDV
in the DAMisc
package to figure out whether the 9-category (0-8) variable could be treated as an interval-level variabled and modeled with OLS without other modifications.library(DAMisc)
mod <- alsosDV(physint ~ polity2 + pop + all_phones_pc + gdppc_mp + revols + riots + agdems + terror_incidents, data=q1data)
Then, we could make a plot of the measurement function:
plot(mod$result)
This looks as though we could use this in as an interval level variable. For those of you in the scaling class, this would not be surprisnig because Cingranelli and Richards motivate the creation of the physical integrity rights index from the point of a cumulative scale and that model fits the data relatively well.
library(car)
mod <- lm(physint ~ polity2 + pop + all_phones_pc + gdppc_mp +
revols + riots + agdems + terror_incidents, data=q1data)
crPlots(model=mod, layout=c(3,3))
I’ll make the following changes to the model and the re-evaluate:
mod2 <- lm(physint ~ poly(polity2, 2) + log(pop) +
log(all_phones_pc) + log(gdppc_mp) +
revols + riots + agdems + terror_incidents, data=q1data)
summary(mod2)
##
## Call:
## lm(formula = physint ~ poly(polity2, 2) + log(pop) + log(all_phones_pc) +
## log(gdppc_mp) + revols + riots + agdems + terror_incidents,
## data = q1data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0758 -0.9380 0.0953 0.9651 5.7792
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.172874 0.254910 28.139 <2e-16 ***
## poly(polity2, 2)1 50.041475 1.871044 26.745 <2e-16 ***
## poly(polity2, 2)2 20.841790 1.769869 11.776 <2e-16 ***
## log(pop) -0.362858 0.020563 -17.646 <2e-16 ***
## log(all_phones_pc) -0.255177 0.027557 -9.260 <2e-16 ***
## log(gdppc_mp) 0.548943 0.036473 15.051 <2e-16 ***
## revols -0.723978 0.059539 -12.160 <2e-16 ***
## riots -0.005719 0.023139 -0.247 0.805
## agdems -0.025929 0.020954 -1.237 0.216
## terror_incidents -0.171783 0.008383 -20.491 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.494 on 3177 degrees of freedom
## Multiple R-squared: 0.5579, Adjusted R-squared: 0.5566
## F-statistic: 445.4 on 9 and 3177 DF, p-value: < 2.2e-16
crPlots(mod2, layout=c(3,3))
I think the only other modification worth making might be to dichotomize revolutions (or at least recode the two high values as threes). Looking at the CR Plot, it appears that the real decrease is from zero to 1 and then other changes result in rouchly no change in the DV. First, if we want to look at transformations, we could use the boxTidwell
function. You’ll note if you do this will all three of the variables that look like they could be amenable to a non-linear transformation (pop
, all_phones_pc
, gdppc_mp
) you’ll notice that the model fails:
boxTidwell(physint ~ gdppc_mp + pop + all_phones_pc,
~ poly(polity2, 2) +
I(revols > 0) + riots + agdems + terror_incidents,
data=q1data)
## Error in lm.fit(cbind(1, x1.p, x2), y, ...): NA/NaN/Inf in 'x'
With one additional argument, you can see where things go wrong.
boxTidwell(physint ~ all_phones_pc + gdppc_mp + pop ,
~ poly(polity2, 2) +
I(revols > 0) + riots + agdems + terror_incidents,
data=q1data, verbose=T)
## iter = 1 powers = 2.527061 1.695795 0.2725716
## iter = 2 powers = 1.733256 -1.113484 0.05912212
## iter = 3 powers = -1.921426 0.5651 0.05605113
## iter = 4 powers = 13.97179 0.3848521 0.004292227
## iter = 5 powers = -37.23096 0.7811189 0.0254453
## iter = 6 powers = -37.10014 0.2974009 0.02349408
## iter = 7 powers = -13.05648 0.67927 0.0211009
## iter = 8 powers = 2827.259 0.4281952 0.02491519
## Error in lm.fit(cbind(1, x1.p, x2), y, ...): NA/NaN/Inf in 'x'
The algorithm wants to send the power on all_phones_pc
to really high values which result in: ∞. Taking out all_phones_pc
highlights a different problem.
boxTidwell(physint ~ gdppc_mp + pop ,
~ poly(polity2, 2) + all_phones_pc,
I(revols > 0) + riots + agdems + terror_incidents,
data=q1data, verbose=T)
## iter = 1 powers = 0.7997319 -139.8122
## Error in lm.fit(cbind(1, x.log.x, x1.p, x2), y, ...): NA/NaN/Inf in 'x'
The gdppc_mp
variable also gets unusually large negative power which essentially makes all values of the transformed variable zero. A little bit of trial and error shows that all_phones_pc
could actually be back in the model, so long as gdppc_mp
is not.
boxTidwell(physint ~ all_phones_pc + pop ,
~ gdppc_mp + poly(polity2, 2) +
I(revols > 0) + riots + agdems + terror_incidents,
data=q1data)
## MLE of lambda Score Statistic (z) Pr(>|z|)
## all_phones_pc 3.488003 -4.5974 4.278e-06 ***
## pop 0.046282 10.2349 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## iterations = 12
We could include polynomials in polity2
and gdppc_mp
to capture their relative assumed forms.
mod3 <- lm(physint ~ poly(polity2, 2) + log(pop) +
I(all_phones_pc^3.5) + poly(gdppc_mp, 3) +
I(revols>0) + riots + agdems + terror_incidents,
data=q1data)
summary(mod3)
##
## Call:
## lm(formula = physint ~ poly(polity2, 2) + log(pop) + I(all_phones_pc^3.5) +
## poly(gdppc_mp, 3) + I(revols > 0) + riots + agdems + terror_incidents,
## data = q1data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6069 -0.9328 0.0401 0.9437 4.6749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.215e+00 1.817e-01 50.707 < 2e-16 ***
## poly(polity2, 2)1 4.234e+01 1.711e+00 24.746 < 2e-16 ***
## poly(polity2, 2)2 1.944e+01 1.723e+00 11.278 < 2e-16 ***
## log(pop) -3.965e-01 2.034e-02 -19.490 < 2e-16 ***
## I(all_phones_pc^3.5) -7.064e-20 1.082e-20 -6.528 7.75e-11 ***
## poly(gdppc_mp, 3)1 3.719e+01 2.284e+00 16.280 < 2e-16 ***
## poly(gdppc_mp, 3)2 -1.343e+01 1.589e+00 -8.452 < 2e-16 ***
## poly(gdppc_mp, 3)3 2.082e+00 1.529e+00 1.362 0.173
## I(revols > 0)TRUE -1.052e+00 8.076e-02 -13.027 < 2e-16 ***
## riots 6.914e-03 2.262e-02 0.306 0.760
## agdems -2.598e-02 2.052e-02 -1.266 0.206
## terror_incidents -1.607e-01 8.329e-03 -19.290 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.467 on 3175 degrees of freedom
## Multiple R-squared: 0.5737, Adjusted R-squared: 0.5722
## F-statistic: 388.4 on 11 and 3175 DF, p-value: < 2.2e-16
The Clarke test would show us that the model with the transformations and polynomials is better:
library(games)
clarke(mod2, mod3)
##
## Clarke test for non-nested models
##
## Model 1 log-likelihood: -5796
## Model 2 log-likelihood: -5738
## Observations: 3187
## Test statistic: 1303 (41%)
##
## Model 2 is preferred (p < 2e-16)
For the polynomial terms, we might want to know whether we got the right degrees of freedom.
NKnots(physint ~ polity2 + log(pop) +
I(all_phones_pc^3.5) + poly(gdppc_mp, 3) +
I(revols>0) + riots + agdems + terror_incidents, max.knots=6,
data=q1data, var="polity2", plot=T, criterion="BIC", includePoly=T)
library(splines)
NKnots(physint ~ gdppc_mp + log(pop) +
I(all_phones_pc^3.5) + bs(polity2, df=4) +
I(revols>0) + riots + agdems + terror_incidents, max.knots=6,
data=q1data, var="gdppc_mp", plot=T, criterion="BIC", includePoly=T)
Note that the we use the BIC to try to identify parsimonious models. The 2-df solution is almost the same as the 6-df solution, so we use the more parsimonious specification. This would suggest the following model:
mod4 <- lm(physint ~ bs(polity2, df=4) + log(pop) +
I(all_phones_pc^3.5) + poly(gdppc_mp, 2) +
I(revols>0) + riots + agdems + terror_incidents,
data=q1data)
summary(mod4)
##
## Call:
## lm(formula = physint ~ bs(polity2, df = 4) + log(pop) + I(all_phones_pc^3.5) +
## poly(gdppc_mp, 2) + I(revols > 0) + riots + agdems + terror_incidents,
## data = q1data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4783 -0.9195 -0.0027 0.9352 4.9461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.474e+00 2.010e-01 42.150 < 2e-16 ***
## bs(polity2, df = 4)1 -3.087e-01 3.161e-01 -0.977 0.329
## bs(polity2, df = 4)2 7.650e-02 2.334e-01 0.328 0.743
## bs(polity2, df = 4)3 8.887e-01 1.827e-01 4.863 1.21e-06 ***
## bs(polity2, df = 4)4 2.338e+00 1.366e-01 17.110 < 2e-16 ***
## log(pop) -4.036e-01 2.026e-02 -19.918 < 2e-16 ***
## I(all_phones_pc^3.5) -6.714e-20 1.070e-20 -6.274 3.99e-10 ***
## poly(gdppc_mp, 2)1 2.959e+01 2.439e+00 12.133 < 2e-16 ***
## poly(gdppc_mp, 2)2 -1.078e+01 1.592e+00 -6.772 1.51e-11 ***
## I(revols > 0)TRUE -1.057e+00 7.973e-02 -13.252 < 2e-16 ***
## riots 7.549e-03 2.237e-02 0.337 0.736
## agdems -1.469e-02 2.031e-02 -0.723 0.469
## terror_incidents -1.536e-01 8.259e-03 -18.592 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.451 on 3174 degrees of freedom
## Multiple R-squared: 0.5831, Adjusted R-squared: 0.5816
## F-statistic: 370 on 12 and 3174 DF, p-value: < 2.2e-16
We could plot the variable effects out with:
library(effects)
plot(allEffects(mod4, xlevels=25))
Since the ALSOS result depends on the RHS specification of the model, we could run it again just to make sure
mod4a <- alsosDV(physint ~ bs(polity2, df=4) + log(pop) +
I(all_phones_pc^3.5) + poly(gdppc_mp, 2) +
I(revols>0) + riots + agdems + terror_incidents,
data=q1data)
plot(mod4a$result)
Now, if we wanted a relatively simple test, we could use a GAM with smoothers on all of the transformed variables and test against the alternative.
library(mgcv)
mod4b <- gam(physint ~ s(polity2, bs="cs") +
s(log(pop), bs="cs") + s(all_phones_pc, bs="cs") +
s(gdppc_mp, bs="cs") + I(revols > 0) + riots +
agdems + terror_incidents, data=q1data)
anova(mod4, mod4b, test="F")
## Analysis of Variance Table
##
## Model 1: physint ~ bs(polity2, df = 4) + log(pop) + I(all_phones_pc^3.5) +
## poly(gdppc_mp, 2) + I(revols > 0) + riots + agdems + terror_incidents
## Model 2: physint ~ s(polity2, bs = "cs") + s(log(pop), bs = "cs") + s(all_phones_pc,
## bs = "cs") + s(gdppc_mp, bs = "cs") + I(revols > 0) + riots +
## agdems + terror_incidents
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3174.0 6683.3
## 2 3155.6 6486.7 18.36 196.6 5.2093 3.083e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(mod4b, pages=1)
This suggests that the parametric model we estimated was not quite flexible enough to fit all of the nuances of the data.
This isn’t required to answer the question, but some of you have asked about time-series cross-sectional data. If you don’t want to estimate the model with a bunch of extra dummy variables in it, you could do the within transformation of all of the variables and then estimate models on the within-transformed data to get the fixed-effects estimator.
cc <- q1data[complete.cases(q1data[,c("physint", "polity2", "pop", "all_phones_pc", "gdppc_mp",
"revols", "riots", "agdems", "terror_incidents")]), ]
X <- model.matrix(~ physint + polity2 + log(pop) + all_phones_pc + gdppc_mp +
I(revols > 0) + agdems + terror_incidents + riots, data=cc)
Xw <- lm(X ~ as.factor(ccode), data=cc)$residuals
Xw <- as.data.frame(Xw[,-1])
Xw$year <- cc$Year
names(Xw)[6] <- "revols"
names(Xw)[3] <- "pop"
mod4c <- gam(physint ~ s(polity2, bs="ts") +
s(pop, bs="ts") + s(all_phones_pc, bs="ts") +
s(gdppc_mp, bs="ts") + revols + riots +
agdems + terror_incidents + s(year, bs="ts"), data=Xw)
anova(mod4, mod4b, test="F")
## Analysis of Variance Table
##
## Model 1: physint ~ bs(polity2, df = 4) + log(pop) + I(all_phones_pc^3.5) +
## poly(gdppc_mp, 2) + I(revols > 0) + riots + agdems + terror_incidents
## Model 2: physint ~ s(polity2, bs = "cs") + s(log(pop), bs = "cs") + s(all_phones_pc,
## bs = "cs") + s(gdppc_mp, bs = "cs") + I(revols > 0) + riots +
## agdems + terror_incidents
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3174.0 6683.3
## 2 3155.6 6486.7 18.36 196.6 5.2093 3.083e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(mod4c, pages=1)
We’re going to continue to investigate repression in this question with a different dependent variable and potentially many other independent variables. This question uses the q2data.rda
file (which wil put an object in your workspace called q2data
). You can get this file by either downloading it from the course website https://quantoid.net/teachicpsr/regression3 or by doing the following in R:
q2 <- file("https://quantoid.net/files/reg3/q2data.rda")
load(q2)
close(q2)
There are too many variables to print out the list, but you can generate it with:
searchVarLabels(q2data, "")
I have organized the data so the country identifiers are first, the DV fariss_repress
is next, then variables related to democracy and rights, next variables related to conflict and then finally variables related to other characteristics of the country. Before you start your investigation, I want you to take 33% of your data out and save it for later. You can do that as follows:
samps <- sample(1:nrow(q2data), floor(.33*nrow(q2data)), replace=FALSE)
keep <- setdiff(1:nrow(q2data), samps)
dat66 <- q2data %>% filter(1:nrow(q2data) %in% keep)
hold <- q2data %>% filter(1:nrow(q2data) %in% samps)
train33 <- sample(1:nrow(dat66), floor(.5*nrow(dat66)), replace=FALSE)
test33 <- setdiff(1:nrow(dat66), train33)
train <- dat66 %>% filter(1:nrow(dat66) %in% train33)
test <- dat66 %>% filter(1:nrow(dat66) %in% test33)
We’ll do the preliminary investigation on half of the dat66
object and the preliminary testing on the other half. Only once we get to a final model we want to try will we do so on the hold
data object.
First, I’ll juse the checks
and parcomp
variables for democracy. I’ll also use terror_binary
, guerrilla
and riots
as the conflict variables. Finally, I’ll include, gdppc_mp
, pop
, and elec_prod_kwh_pc
as the other variables in the model.
mod <- lm(fariss_repress ~ checks + parcomp + terror_binary + guerrilla + riots +
gdppc_mp + pop + elec_prod_kwh_pc + Year, data=train)
summary(mod)
##
## Call:
## lm(formula = fariss_repress ~ checks + parcomp + terror_binary +
## guerrilla + riots + gdppc_mp + pop + elec_prod_kwh_pc + Year,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.09642 -0.47263 -0.00339 0.50428 2.51319
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.053e+01 6.395e+00 -1.647 0.099858 .
## checks -3.739e-02 1.785e-02 -2.095 0.036445 *
## parcomp -3.196e-01 2.262e-02 -14.130 < 2e-16 ***
## terror_binary 6.150e-01 4.954e-02 12.416 < 2e-16 ***
## guerrilla 6.754e-01 6.249e-02 10.808 < 2e-16 ***
## riots 6.959e-02 1.964e-02 3.543 0.000412 ***
## gdppc_mp -2.973e-05 3.858e-06 -7.706 2.97e-14 ***
## pop 9.003e-07 1.725e-07 5.219 2.16e-07 ***
## elec_prod_kwh_pc -2.962e-06 3.715e-07 -7.972 4.00e-15 ***
## Year 5.656e-03 3.212e-03 1.761 0.078547 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7651 on 1063 degrees of freedom
## Multiple R-squared: 0.6395, Adjusted R-squared: 0.6365
## F-statistic: 209.5 on 9 and 1063 DF, p-value: < 2.2e-16
library(rpart)
cart1 <- rpart(formula(mod), data=train)
plot(cart1)
text(cart1)
cart.test <- predict(cart1, newdata=test)
cart.train <- predict(cart1, newdata=train)
cor(cart.test, test$fariss_repress)^2
## [1] 0.6759275
library(randomForest)
rf1 <- randomForest(formula(mod), data=train)
rf.test <- predict(rf1, newdata=test)
rf.train <- predict(rf1, newdata=train)
cor(rf.test, test$fariss_repress)^2
## [1] 0.8470609
library(xgboost)
Xm <- model.matrix(formula(mod), data=train)
Xmtest <- model.matrix(formula(mod), data=test)
y <- model.response(model.frame(formula(mod), data=train))
ytest <- model.response(model.frame(formula(mod), data=test))
dtrain <- xgb.DMatrix(data=Xm, info=list(label=y))
xgb1c <- xgb.cv(data=dtrain, nfold=5, ncross=25, nrounds=50)
xgb1 <- xgboost(data=dtrain, nrounds=50, params=list(objective="reg:linear"))
xgb.test <- predict(xgb1, newdata=Xmtest, nrounds=50)
xgb.train <- predict(xgb1, newdata=Xm, nrounds=50)
For the bartMachine
analysis, is used bartMachineCV
(which takes a reasonably long time), but it identified the following set of parameters as those that minimize the out of sample error: k=2
, nu=10
, q=.75
, num_trees=200
, so I implement that advice below.
library(bartMachine)
Xmdf <- as.data.frame(model.matrix(formula(mod), data=train))
Xmtestdf <- as.data.frame(model.matrix(formula(mod), data=test))
bm1 <- bartMachine(Xmdf, y, k=2, nu=10, q=.75, num_trees=200)
## bartMachine initializing with 200 trees...
## bartMachine vars checked...
## bartMachine java init...
## bartMachine factors created...
## bartMachine before preprocess...
## bartMachine after preprocess... 11 total features...
## bartMachine sigsq estimated...
## bartMachine training data finalized...
## Now building bartMachine for regression ...
## evaluating in sample data...done
bm1.test <- predict(bm1, new_data=Xmtestdf)
bm1.train <- bm1$y_hat_train
Now that we’ve done all of those models, we can look at the in- and out-of-sample performance.
iscor <- cor(cbind(bm1.train, xgb.train, rf.train, cart.train), y)^2
oscor <- cor(cbind(bm1.test, xgb.test, rf.test, cart.test), ytest)^2
cors <- cbind( iscor, oscor)
colnames(cors) <- c("In", "Out")
rownames(cors) <- c("BART", "XGB", "RF", "CART")
cors
## In Out
## BART 0.8863910 0.8033032
## XGB 0.9907074 0.8570469
## RF 0.9691669 0.8470609
## CART 0.7166293 0.6759275
We could look to see what the important features are:
bm1.imp <- investigate_var_importance(bm1, type="splits", plot=F)
## .....
bm1.imp <- bm1.imp[[1]]
rf1.imp <- rf1$importance/sum(rf1$importance)
xgb.imp <- xgb.importance(model=xgb1)
cart.imp <- cart1$variable.importance/sum(cart1$variable.importance)
imports <- xgb.imp[,c(1,4)]
names(imports)[2] <- "XGB"
imports <- cbind(imports, RF=rf1.imp[match(imports[[1]], rownames(rf1.imp))])
imports <- cbind(imports, CART=cart.imp[match(imports[[1]], names(cart.imp))])
imports <- cbind(imports, BART=bm1.imp[match(imports[[1]], names(bm1.imp))])
imports <- imports[,c(1,5,2,3,4)]
imports <- imports[order(rowMeans(imports[,-1]), decreasing = TRUE)]
print.data.frame(imports, digits=2)
## Feature BART XGB RF CART
## 1 parcomp 0.141 0.062 0.254 0.3844
## 2 gdppc_mp 0.118 0.196 0.185 0.2222
## 3 pop 0.186 0.253 0.165 0.1036
## 4 elec_prod_kwh_pc 0.161 0.184 0.164 0.1531
## 5 checks 0.098 0.102 0.052 0.0310
## 6 Year 0.090 0.115 0.053 0.0220
## 7 guerrilla 0.079 0.028 0.079 0.0773
## 8 terror_binary 0.060 0.028 0.035 0.0042
## 9 riots 0.067 0.031 0.012 0.0023
In general, it looks like the four features at the top of the table are the most important. We should make the various plots of interest for those variables. First, let’s look at the partial dependence plots.
library(pdp)
cart.pdp <- partial(cart1, data=train, pred.var="parcomp", pred.grid=data.frame(parcomp=1:5))
rf.pdp <- partial(rf1, data=train, pred.var="parcomp", pred.grid=data.frame(parcomp=1:5))
## partial doesn't work with xgboost, so we're cooking one up on our own
# make new data that is just the original data
newdf <- lapply(1:5, function(x)Xm)
# for each one of the new datasets, replace parcomp with a different value from 1 to 5
for(i in 1:5){newdf[[i]][,"parcomp"] <- i}
# generate all of the predictions
xgb.preds1 <- lapply(newdf, function(x)predict(xgb1, data=dtrain, newdata=x))
xgb.partial <- sapply(xgb.preds1, mean)
bart.preds1 <- lapply(newdf, function(x)predict(bm1, new_data=as.data.frame(x)))
bart.partial <- sapply(bart.preds1, mean)
plot.df <- data.frame(
preds = c(cart.pdp[,2], rf.pdp[,2], xgb.partial, bart.partial),
model = factor(rep(c("CART", "RF", "XGB", "BART"), each=5)),
parcomp = 1:5)
library(lattice)
xyplot(preds ~ parcomp, groups=model, data=plot.df, type="o", pch=16,
auto.key=list(space="top", columns=2, points=F, lines=T),
main = "PARCOMP PDP")
We could also do the same thing for the values of gdppc_mp
. Here, instead of the values 1:5
, we need to pick other values along the scale of gdppc_mp
.
s <- quantile(q2data$gdppc_mp, probs=seq(.05,.95, by=.1))
cart.pdp <- partial(cart1, data=train, pred.var="gdppc_mp", pred.grid=data.frame(gdppc_mp = s))
rf.pdp <- partial(rf1, data=train, pred.var="gdppc_mp", pred.grid=data.frame(gdppc_mp=s))
## partial doesn't work with xgboost, so we're cooking one up on our own
# make new data that is just the original data
newdf <- lapply(1:length(s), function(x)Xm)
# for each one of the new datasets, replace parcomp with a different value from 1 to 5
for(i in 1:length(s)){newdf[[i]][,"gdppc_mp"] <- s[i]}
# generate all of the predictions
xgb.preds2 <- lapply(newdf, function(x)predict(xgb1, data=dtrain, newdata=x))
xgb.partial <- sapply(xgb.preds2, mean)
bart.preds2 <- lapply(newdf, function(x)predict(bm1, new_data=as.data.frame(x)))
bart.partial <- sapply(bart.preds2, mean)
plot.df <- data.frame(
preds = c(cart.pdp[,2], rf.pdp[,2], xgb.partial, bart.partial),
model = factor(rep(c("CART", "RF", "XGB", "BART"), each=length(s))),
gdppc_mp = s)
xyplot(preds ~ gdppc_mp, groups=model, data=plot.df, type="o", pch=16,
auto.key=list(space="top", columns=2, points=F, lines=T),
main = "GDP/capita PDP")
We could also make the ICE plots, which will allow us to evaluate whether or not there is much in the way of any interactions.
library(RColorBrewer)
cols <- brewer.pal(5, "Set1")
library(ICEbox)
cart.ice <- ice(cart1, X=Xmdf, predictor="parcomp")
## .....
## y not passed, so range_y is range of ice curves and sd_y is sd of predictions on real observations
cice <- clusterICE(cart.ice, nClusters=1)
rf.ice <- ice(rf1, X=Xm, predictor="parcomp")
## .....
## y not passed, so range_y is range of ice curves and sd_y is sd of predictions on real observations
rice <- clusterICE(rf.ice, nClusters=5, plot_legend=T, colorvec=cols)
xgb.ice <- ice(xgb1, X=Xm, predictor="parcomp")
## .....
## y not passed, so range_y is range of ice curves and sd_y is sd of predictions on real observations
xice <- clusterICE(xgb.ice, nClusters=5, plot_legend=T, colorvec=cols)
bart.ice <- ice(bm1, X=Xmdf, predictor="parcomp")
## .....
## y not passed, so range_y is range of ice curves and sd_y is sd of predictions on real observations
bice <- clusterICE(bart.ice, nClusters=5, plot_legend=T, colorvec=cols)
And, then we could do the same thing for GDP/capita.
cart.ice <- ice(cart1, X=Xmdf, predictor="gdppc_mp")
## ..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
## y not passed, so range_y is range of ice curves and sd_y is sd of predictions on real observations
cice <- clusterICE(cart.ice, nClusters=1)
rf.ice <- ice(rf1, X=Xm, predictor="gdppc_mp")
## ..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
## y not passed, so range_y is range of ice curves and sd_y is sd of predictions on real observations
rice <- clusterICE(rf.ice, nClusters=5, plot_legend=T, colorvec=cols)
xgb.ice <- ice(xgb1, X=Xm, predictor="gdppc_mp")
## ..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
## y not passed, so range_y is range of ice curves and sd_y is sd of predictions on real observations
xice <- clusterICE(xgb.ice, nClusters=5, plot_legend=T, colorvec=cols)
bart.ice <- ice(bm1, X=Xmdf, predictor="gdppc_mp", num_grid_pts=10)
## ..........
## y not passed, so range_y is range of ice curves and sd_y is sd of predictions on real observations
bice <- clusterICE(bart.ice, nClusters=5, plot_legend=T, colorvec=cols)
There is some sense from looking at the plots that non-linearity and interactions could be a problem.
diagFun
function that we talked about in class to diagnose with your parametric model. What, if any, problems exist.source("https://quantoid.net/files/reg3/diagfun.r")
mod <- update(mod, data=dat66)
out <- diagFun(mod, data=dat66, splitSample=T, returnMods=T, partialFirst="X_res")
out$r2
## Orig polywogInt marsInt polywogNL marsNL marsFull
## 0.7783776 0.7665614 0.8079523 0.8880682 0.9026894 0.9069571
## polywogFull
## 0.8345241
print(out$coefs)
## polywogInt
## Estimate Std. Error t value Pr(>|t|)
## gdppc_mp -4.3774e+10 1.6228e+10 -2.6974 0.0070992 **
## riots*Year 1.0385e-02 3.0662e-03 3.3869 0.0007328 ***
## pop*Year 4.3736e-08 2.0875e-08 2.0951 0.0363978 *
## pop -2.7484e+09 9.4408e+08 -2.9112 0.0036754 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## marsInt
## Estimate Std. Error t value Pr(>|t|)
## parcomp*pop 8.0084e-07 1.2536e-07 6.3883 2.508e-10 ***
## gdppc_mp*elec_prod_kwh_pc 9.0367e-11 2.1241e-11 4.2544 2.281e-05 ***
## guerrilla*pop -1.5916e-06 2.4946e-07 -6.3802 2.640e-10 ***
## checks*parcomp -3.4812e-02 1.1005e-02 -3.1632 0.001605 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## polywogNL
## Estimate Std. Error t value Pr(>|t|)
## checks^2 7.6315e-03 2.6918e-03 2.8351 0.004669 **
## parcomp^2 5.6852e-01 1.1660e-01 4.8760 1.250e-06 ***
## guerrilla^2 -3.2512e-01 7.2101e-02 -4.5092 7.237e-06 ***
## riots^2 -4.2124e-03 1.6793e-03 -2.5085 0.012275 *
## gdppc_mp^2 1.8667e-09 6.0372e-10 3.0921 0.002040 **
## pop^2 -1.8925e-11 2.1341e-12 -8.8676 < 2.2e-16 ***
## parcomp^3 -7.5522e-02 1.3008e-02 -5.8058 8.472e-09 ***
## gdppc_mp^3 -1.6576e-14 7.1413e-15 -2.3211 0.020470 *
## pop^3 8.8544e-18 1.3459e-18 6.5787 7.465e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## marsNL
## Estimate Std. Error t value Pr(>|t|)
## h(parcomp-4) -7.3546e-01 8.7479e-02 -8.4073 < 2.2e-16 ***
## h(pop-44927) -1.2166e-05 2.6716e-06 -4.5538 5.883e-06 ***
## h(guerrilla-1) -7.2548e-01 1.7643e-01 -4.1120 4.228e-05 ***
## h(checks-6) 1.2959e-01 4.5349e-02 2.8577 0.0043515 **
## h(gdppc_mp-9810) 4.8145e-05 1.4585e-05 3.3011 0.0009954 ***
## h(pop-4018) -1.2839e-04 2.3467e-05 -5.4711 5.585e-08 ***
## h(pop-165838) -3.2328e-06 1.3062e-06 -2.4749 0.0134832 *
## gdppc_mp -3.2503e+10 1.4076e+10 -2.3092 0.0211276 *
## pop -1.8648e+09 8.1913e+08 -2.2766 0.0230120 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## marsFULL
## Estimate Std. Error t value
## h(parcomp-4) -7.7318e-01 9.8800e-02 -7.8257
## h(44927-pop) -1.6727e-05 2.7538e-06 -6.0742
## h(guerrilla-1) -6.8257e-01 1.7516e-01 -3.8968
## h(checks-6) 1.2611e-01 4.5185e-02 2.7909
## h(9810-gdppc_mp) 4.3843e-05 1.4539e-05 3.0155
## h(3802-pop)*h(98928-elec_prod_kwh_pc) -2.1349e-09 3.3800e-10 -6.3164
## h(pop-4538)*h(elec_prod_kwh_pc-98928) 5.1225e-11 1.8941e-11 2.7044
## h(4538-pop)*h(elec_prod_kwh_pc-98928) -9.7277e-10 3.9824e-10 -2.4427
## h(parcomp-4)*h(pop-2816) -5.5002e-06 2.7478e-06 -2.0017
## h(parcomp-4)*h(2816-pop) 2.5382e-04 1.0031e-04 2.5303
## Pr(>|t|)
## h(parcomp-4) 1.226e-14 ***
## h(44927-pop) 1.739e-09 ***
## h(guerrilla-1) 0.0001037 ***
## h(checks-6) 0.0053513 **
## h(9810-gdppc_mp) 0.0026273 **
## h(3802-pop)*h(98928-elec_prod_kwh_pc) 3.944e-10 ***
## h(pop-4538)*h(elec_prod_kwh_pc-98928) 0.0069532 **
## h(4538-pop)*h(elec_prod_kwh_pc-98928) 0.0147417 *
## h(parcomp-4)*h(pop-2816) 0.0455754 *
## h(parcomp-4)*h(2816-pop) 0.0115411 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## polywogFull
## Estimate Std. Error t value Pr(>|t|)
## parcomp*elec_prod_kwh_pc -3.8969e-04 8.5749e-05 -4.5446 6.149e-06
## riots*Year 1.4667e-02 4.5859e-03 3.1983 0.001424
## gdppc_mp*elec_prod_kwh_pc 3.4389e-08 1.1006e-08 3.1246 0.001830
## parcomp*elec_prod_kwh_pc*Year 1.9448e-07 4.2957e-08 4.5273 6.663e-06
## guerrilla*pop*Year -8.7351e-10 1.2206e-10 -7.1564 1.561e-12
## gdppc_mp*elec_prod_kwh_pc*Year -1.7135e-11 5.5014e-12 -3.1147 0.001892
## pop^2*Year -3.8968e-15 3.0255e-16 -12.8802 < 2.2e-16
##
## parcomp*elec_prod_kwh_pc ***
## riots*Year **
## gdppc_mp*elec_prod_kwh_pc **
## parcomp*elec_prod_kwh_pc*Year ***
## guerrilla*pop*Year ***
## gdppc_mp*elec_prod_kwh_pc*Year **
## pop^2*Year ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It looks like the marsNL
model is probably the right mix of flexibility and explanatory parsimony. Because there are no interactions, the plotmo
function should prove interesting to identify the nature of the non-linearities.
plotmo(out$marsNL, pt.col=1)
## plotmo grid: checks parcomp terror_binary guerrilla riots gdppc_mp
## 3 3 1 0 0 1540
## pop elec_prod_kwh_pc Year
## 8980 21258 1995
There are a few big outliers on the checks
variable, which seem to be driving the non-linearity in that variable. The parcomp
variable probably should be dummied out and represented with 4 regressors. That’s something we might have done from the beginning anyway. It appears that the main variation on guerrilla
is between 0 and non-0, so that could be dichotomized. It looks, from the plot, like gdppc_mp
and pop
could both be logged. Let’s look again at what would have happened if we had specified these ahead of time.
mod2 <- lm(fariss_repress ~ log(checks) + as.factor(parcomp) + terror_binary + I(guerrilla > 0) + riots +
poly(gdppc_mp,3) + log(pop) + elec_prod_kwh_pc + Year, data=dat66)
out <- diagFun(mod2, data=dat66, splitSample=T, returnMods=T, partialFirst="X_res")
out$r2
## Orig polywogInt marsInt polywogNL marsNL marsFull
## 0.9067413 0.8106342 0.7726723 0.8876238 0.9126733 0.9173652
## polywogFull
## 0.8294572
Specifying the model this way indicates that there is very little left over for the non-parametric model to do.
cart.all <- rpart(fariss_repress ~ ., data=train[,-c(1,3,4)])
rf.all <- randomForest(fariss_repress ~ ., data=train[,-c(1,3,4)])
X.all <- model.matrix(fariss_repress ~ ., data=train[,-c(1,3,4)])
X.all.test <- model.matrix(fariss_repress ~ ., data=test[,-c(1,3,4)])
y.all <- model.response(model.frame(fariss_repress ~ ., data=train[,-c(1,3,4)]))
#xgb.all <- xgb.cv(data=X.all, label=y.all, watchlist=list(data=X.all.test), nfold=10, nrounds=100)
xgb.all <- xgboost(data=X.all, label=y.all, nrounds=16)
## [1] train-rmse:1.096257
## [2] train-rmse:0.834676
## [3] train-rmse:0.650873
## [4] train-rmse:0.511497
## [5] train-rmse:0.414151
## [6] train-rmse:0.349241
## [7] train-rmse:0.299354
## [8] train-rmse:0.265453
## [9] train-rmse:0.240429
## [10] train-rmse:0.224391
## [11] train-rmse:0.202586
## [12] train-rmse:0.184972
## [13] train-rmse:0.169743
## [14] train-rmse:0.157192
## [15] train-rmse:0.148472
## [16] train-rmse:0.138863
bart.all <- bartMachine(X=as.data.frame(X.all), y.all, k=2, nu=10, q=.75, num_trees=200)
## bartMachine initializing with 200 trees...
## bartMachine vars checked...
## bartMachine java init...
## bartMachine factors created...
## bartMachine before preprocess...
## bartMachine after preprocess... 90 total features...
## bartMachine sigsq estimated...
## bartMachine training data finalized...
## Now building bartMachine for regression ...
## evaluating in sample data...done
bart.imp <- investigate_var_importance(bart.all, type="splits", plot=F)
## .....
bart.imp <- bart.imp[[1]]
rf.imp <- rf.all$importance/sum(rf.all$importance)
xgb.imp <- xgb.importance(model=xgb.all)
cart.imp <- cart.all$variable.importance/sum(cart.all$variable.importance)
imports <- xgb.imp[,c(1,4)]
names(imports)[2] <- "XGB"
imports <- cbind(imports, RF=rf.imp[match(imports[[1]], rownames(rf.imp))])
imports <- cbind(imports, CART=cart.imp[match(imports[[1]], names(cart.imp))])
imports <- cbind(imports, BART=bart.imp[match(imports[[1]], names(bart.imp))])
imports <- imports[,c(1,5,2,3,4)]
imports <- imports[order(rowMeans(imports[,-1]), decreasing = TRUE)]
head(imports, n=10)
## Feature BART XGB RF CART
## 1: polity2 0.019499217 0.029370629 0.155350051 0.169758103
## 2: parcomp 0.028937106 0.005594406 0.124756251 0.146501797
## 3: infant_mort 0.019856748 0.016783217 0.059905870 0.087056231
## 4: conflict_binary 0.020693558 0.006993007 0.085212582 0.064700001
## 5: phones_no_cell_pc 0.025752703 0.009790210 0.025125823 0.099098788
## 6: gdppc_fc 0.011456819 0.015384615 0.012951857 0.081516353
## 7: gdppc_mp 0.006461439 0.012587413 0.011777230 0.082482899
## 8: Year 0.021559702 0.071328671 0.007856007 0.006434407
## 9: primsch_enroll 0.016311200 0.015384615 0.038905061 0.027442217
## 10: conflictindex 0.019071514 0.019580420 0.032491644 0.018185946
It seems that some of the features I neglected are important in trying to predict the outcome. Here’s what the in- and out-of-sample predictive capacity look like.
cart.testa <- predict(cart.all, newdata=test)
cart.traina <- predict(cart.all, newdata=train)
rf.testa <- predict(rf.all, newdata=test)
rf.traina <- predict(rf.all, newdata=train)
xgb.testa <- predict(xgb.all, newdata=X.all.test, nrounds=50)
xgb.traina <- predict(xgb.all, newdata=X.all, nrounds=50)
bm1.testa <- predict(bart.all, new_data=as.data.frame(X.all.test))
bm1.traina <- bm1$y_hat_train
iscora <- cor(cbind(bm1.traina, xgb.traina, rf.traina, cart.traina), y)^2
oscora <- cor(cbind(bm1.testa, xgb.testa, rf.testa, cart.testa), ytest)^2
corsa <- cbind( iscora, oscora)
colnames(corsa) <- c("In", "Out")
rownames(corsa) <- c("BART", "XGB", "RF", "CART")
corsa
## In Out
## BART 0.8863910 0.9060813
## XGB 0.9886118 0.8927250
## RF 0.9886008 0.9155704
## CART 0.7840297 0.7389020
Compare this to the in- and out-of-sample predictions from the models with just our selected few variables in them:
cors
## In Out
## BART 0.8863910 0.8033032
## XGB 0.9907074 0.8570469
## RF 0.9691669 0.8470609
## CART 0.7166293 0.6759275
What we get is some additional out-of-sample predictive capacity when we cast our net a bit wider.