Regression III: Lab2 Answers

Both of the questions in this lab use state repression as the dependent variable. In a general sense, state repression is the violation of human rights by the state. In this case, the focus is on the set of “physical integrity rights” - the rights to be free from torture, political imprisonment, extrajudicial killing and forced disappearance.

Question 1

This question uses the q1data.rda file (which wil put an object in your workspace called q1data). You can get this file by either downloading it from the course website https://quantoid.net/teachicpsr/regression3 or by doing the following in R:

q1 <- file("https://quantoid.net/files/reg3/q1data.rda")
load(q1)
close(q1)

This file has a number of variables. You can find a short description of each with:

searchVarLabels(q1data, "")

##                  ind                                             label
## ccode              1                                          COW code
## Year               2                                              Year
## polity2            3                                           Polity2
## pop                4                                        Population
## all_phones_pc      5    All Telephones, including Cellular, Per Capita
## gdppc_mp           6 Gross National Product Per Capita (Market Prices)
## revols             7                                       Revolutions
## riots              8                                             Riots
## agdems             9                    Anti-Government Demonstrations
## terror_incidents  10                     Number of terrorist incidents
## physint           11                     CIRI Physical Integrity Index

The dependent variable is going to be physint the CIRI physical integrity rights index. All of the other variables, except for ccode and Year will be the independent variables.

Use alsosDV in the DAMisc package to figure out whether the 9-category (0-8) variable could be treated as an interval-level variabled and modeled with OLS without other modifications.

library(DAMisc)
mod <- alsosDV(physint ~ polity2 + pop + all_phones_pc + gdppc_mp +         revols + riots + agdems + terror_incidents, data=q1data)

Then, we could make a plot of the measurement function:

plot(mod$result)

plot of chunk unnamed-chunk-4

This looks as though we could use this in as an interval level variable. For those of you in the scaling class, this would not be surprisnig because Cingranelli and Richards motivate the creation of the physical integrity rights index from the point of a cumulative scale and that model fits the data relatively well.

Once you’ve done that, diagnose problems with non-linearity using the conventional methods we talked about early last week (e.g., CR Plots). Implement simple fixes to the problems if they exist.

library(car)
mod <- lm(physint ~ polity2 + pop + all_phones_pc + gdppc_mp + 
        revols + riots + agdems + terror_incidents, data=q1data)
crPlots(model=mod, layout=c(3,3))

plot of chunk unnamed-chunk-5

I’ll make the following changes to the model and the re-evaluate:

mod2 <- lm(physint ~ poly(polity2, 2) + log(pop) + 
          log(all_phones_pc) + log(gdppc_mp) + 
        revols + riots + agdems + terror_incidents, data=q1data)
summary(mod2)

## 
## Call:
## lm(formula = physint ~ poly(polity2, 2) + log(pop) + log(all_phones_pc) + 
##     log(gdppc_mp) + revols + riots + agdems + terror_incidents, 
##     data = q1data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0758 -0.9380  0.0953  0.9651  5.7792 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.172874   0.254910  28.139   <2e-16 ***
## poly(polity2, 2)1  50.041475   1.871044  26.745   <2e-16 ***
## poly(polity2, 2)2  20.841790   1.769869  11.776   <2e-16 ***
## log(pop)           -0.362858   0.020563 -17.646   <2e-16 ***
## log(all_phones_pc) -0.255177   0.027557  -9.260   <2e-16 ***
## log(gdppc_mp)       0.548943   0.036473  15.051   <2e-16 ***
## revols             -0.723978   0.059539 -12.160   <2e-16 ***
## riots              -0.005719   0.023139  -0.247    0.805    
## agdems             -0.025929   0.020954  -1.237    0.216    
## terror_incidents   -0.171783   0.008383 -20.491   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.494 on 3177 degrees of freedom
## Multiple R-squared:  0.5579, Adjusted R-squared:  0.5566 
## F-statistic: 445.4 on 9 and 3177 DF,  p-value: < 2.2e-16

crPlots(mod2, layout=c(3,3))

plot of chunk unnamed-chunk-6

I think the only other modification worth making might be to dichotomize revolutions (or at least recode the two high values as threes). Looking at the CR Plot, it appears that the real decrease is from zero to 1 and then other changes result in rouchly no change in the DV. First, if we want to look at transformations, we could use the boxTidwell function. You’ll note if you do this will all three of the variables that look like they could be amenable to a non-linear transformation (pop, all_phones_pc, gdppc_mp) you’ll notice that the model fails:

boxTidwell(physint ~ gdppc_mp + pop + all_phones_pc, 
           ~  poly(polity2, 2) + 
             I(revols > 0) + riots + agdems + terror_incidents,
           data=q1data)

## Error in lm.fit(cbind(1, x1.p, x2), y, ...): NA/NaN/Inf in 'x'

With one additional argument, you can see where things go wrong.

boxTidwell(physint ~ all_phones_pc + gdppc_mp + pop , 
           ~  poly(polity2, 2) + 
             I(revols > 0) + riots + agdems + terror_incidents,
           data=q1data, verbose=T)

##  iter = 1     powers = 2.527061 1.695795 0.2725716 
##  iter = 2     powers = 1.733256 -1.113484 0.05912212 
##  iter = 3     powers = -1.921426 0.5651 0.05605113 
##  iter = 4     powers = 13.97179 0.3848521 0.004292227 
##  iter = 5     powers = -37.23096 0.7811189 0.0254453 
##  iter = 6     powers = -37.10014 0.2974009 0.02349408 
##  iter = 7     powers = -13.05648 0.67927 0.0211009 
##  iter = 8     powers = 2827.259 0.4281952 0.02491519

## Error in lm.fit(cbind(1, x1.p, x2), y, ...): NA/NaN/Inf in 'x'

The algorithm wants to send the power on all_phones_pc to really high values which result in: ∞. Taking out all_phones_pc highlights a different problem.

boxTidwell(physint ~ gdppc_mp + pop , 
           ~  poly(polity2, 2) + all_phones_pc, 
             I(revols > 0) + riots + agdems + terror_incidents,
           data=q1data, verbose=T)

##  iter = 1     powers = 0.7997319 -139.8122

## Error in lm.fit(cbind(1, x.log.x, x1.p, x2), y, ...): NA/NaN/Inf in 'x'

The gdppc_mp variable also gets unusually large negative power which essentially makes all values of the transformed variable zero. A little bit of trial and error shows that all_phones_pc could actually be back in the model, so long as gdppc_mp is not.

boxTidwell(physint ~ all_phones_pc +  pop , 
           ~  gdppc_mp + poly(polity2, 2) + 
             I(revols > 0) + riots + agdems + terror_incidents,
           data=q1data)

##               MLE of lambda Score Statistic (z)  Pr(>|z|)    
## all_phones_pc      3.488003             -4.5974 4.278e-06 ***
## pop                0.046282             10.2349 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## iterations =  12

We could include polynomials in polity2 and gdppc_mp to capture their relative assumed forms.

mod3 <- lm(physint ~ poly(polity2, 2) + log(pop) + 
          I(all_phones_pc^3.5) + poly(gdppc_mp, 3) + 
          I(revols>0) + riots + agdems + terror_incidents,   
          data=q1data)
summary(mod3)

## 
## Call:
## lm(formula = physint ~ poly(polity2, 2) + log(pop) + I(all_phones_pc^3.5) + 
##     poly(gdppc_mp, 3) + I(revols > 0) + riots + agdems + terror_incidents, 
##     data = q1data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6069 -0.9328  0.0401  0.9437  4.6749 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           9.215e+00  1.817e-01  50.707  < 2e-16 ***
## poly(polity2, 2)1     4.234e+01  1.711e+00  24.746  < 2e-16 ***
## poly(polity2, 2)2     1.944e+01  1.723e+00  11.278  < 2e-16 ***
## log(pop)             -3.965e-01  2.034e-02 -19.490  < 2e-16 ***
## I(all_phones_pc^3.5) -7.064e-20  1.082e-20  -6.528 7.75e-11 ***
## poly(gdppc_mp, 3)1    3.719e+01  2.284e+00  16.280  < 2e-16 ***
## poly(gdppc_mp, 3)2   -1.343e+01  1.589e+00  -8.452  < 2e-16 ***
## poly(gdppc_mp, 3)3    2.082e+00  1.529e+00   1.362    0.173    
## I(revols > 0)TRUE    -1.052e+00  8.076e-02 -13.027  < 2e-16 ***
## riots                 6.914e-03  2.262e-02   0.306    0.760    
## agdems               -2.598e-02  2.052e-02  -1.266    0.206    
## terror_incidents     -1.607e-01  8.329e-03 -19.290  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.467 on 3175 degrees of freedom
## Multiple R-squared:  0.5737, Adjusted R-squared:  0.5722 
## F-statistic: 388.4 on 11 and 3175 DF,  p-value: < 2.2e-16

The Clarke test would show us that the model with the transformations and polynomials is better:

library(games)
clarke(mod2, mod3)

## 
## Clarke test for non-nested models
## 
## Model 1 log-likelihood: -5796
## Model 2 log-likelihood: -5738
## Observations: 3187
## Test statistic: 1303 (41%)
## 
## Model 2 is preferred (p < 2e-16)

For the polynomial terms, we might want to know whether we got the right degrees of freedom.

NKnots(physint ~ polity2 + log(pop) + 
          I(all_phones_pc^3.5) + poly(gdppc_mp, 3) + 
          I(revols>0) + riots + agdems + terror_incidents,   max.knots=6, 
          data=q1data, var="polity2", plot=T, criterion="BIC", includePoly=T)

plot of chunk unnamed-chunk-13

library(splines)
NKnots(physint ~ gdppc_mp + log(pop) + 
          I(all_phones_pc^3.5) + bs(polity2, df=4) + 
          I(revols>0) + riots + agdems + terror_incidents,   max.knots=6, 
          data=q1data, var="gdppc_mp", plot=T, criterion="BIC", includePoly=T)

plot of chunk unnamed-chunk-14

Note that the we use the BIC to try to identify parsimonious models. The 2-df solution is almost the same as the 6-df solution, so we use the more parsimonious specification. This would suggest the following model:

mod4 <- lm(physint ~ bs(polity2, df=4) + log(pop) + 
          I(all_phones_pc^3.5) + poly(gdppc_mp, 2) + 
          I(revols>0) + riots + agdems + terror_incidents,   
          data=q1data)
summary(mod4)

## 
## Call:
## lm(formula = physint ~ bs(polity2, df = 4) + log(pop) + I(all_phones_pc^3.5) + 
##     poly(gdppc_mp, 2) + I(revols > 0) + riots + agdems + terror_incidents, 
##     data = q1data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4783 -0.9195 -0.0027  0.9352  4.9461 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           8.474e+00  2.010e-01  42.150  < 2e-16 ***
## bs(polity2, df = 4)1 -3.087e-01  3.161e-01  -0.977    0.329    
## bs(polity2, df = 4)2  7.650e-02  2.334e-01   0.328    0.743    
## bs(polity2, df = 4)3  8.887e-01  1.827e-01   4.863 1.21e-06 ***
## bs(polity2, df = 4)4  2.338e+00  1.366e-01  17.110  < 2e-16 ***
## log(pop)             -4.036e-01  2.026e-02 -19.918  < 2e-16 ***
## I(all_phones_pc^3.5) -6.714e-20  1.070e-20  -6.274 3.99e-10 ***
## poly(gdppc_mp, 2)1    2.959e+01  2.439e+00  12.133  < 2e-16 ***
## poly(gdppc_mp, 2)2   -1.078e+01  1.592e+00  -6.772 1.51e-11 ***
## I(revols > 0)TRUE    -1.057e+00  7.973e-02 -13.252  < 2e-16 ***
## riots                 7.549e-03  2.237e-02   0.337    0.736    
## agdems               -1.469e-02  2.031e-02  -0.723    0.469    
## terror_incidents     -1.536e-01  8.259e-03 -18.592  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.451 on 3174 degrees of freedom
## Multiple R-squared:  0.5831, Adjusted R-squared:  0.5816 
## F-statistic:   370 on 12 and 3174 DF,  p-value: < 2.2e-16

We could plot the variable effects out with:

library(effects)
plot(allEffects(mod4, xlevels=25))

plot of chunk unnamed-chunk-16

Since the ALSOS result depends on the RHS specification of the model, we could run it again just to make sure

mod4a <- alsosDV(physint ~ bs(polity2, df=4) + log(pop) + 
          I(all_phones_pc^3.5) + poly(gdppc_mp, 2) + 
          I(revols>0) + riots + agdems + terror_incidents,   
          data=q1data)
plot(mod4a$result)

plot of chunk unnamed-chunk-17

Now, if we wanted a relatively simple test, we could use a GAM with smoothers on all of the transformed variables and test against the alternative.

library(mgcv)
mod4b <- gam(physint ~ s(polity2, bs="cs") + 
           s(log(pop), bs="cs") + s(all_phones_pc, bs="cs") +
           s(gdppc_mp, bs="cs") + I(revols > 0) + riots +
           agdems + terror_incidents, data=q1data)
anova(mod4, mod4b, test="F")

## Analysis of Variance Table
## 
## Model 1: physint ~ bs(polity2, df = 4) + log(pop) + I(all_phones_pc^3.5) + 
##     poly(gdppc_mp, 2) + I(revols > 0) + riots + agdems + terror_incidents
## Model 2: physint ~ s(polity2, bs = "cs") + s(log(pop), bs = "cs") + s(all_phones_pc, 
##     bs = "cs") + s(gdppc_mp, bs = "cs") + I(revols > 0) + riots + 
##     agdems + terror_incidents
##   Res.Df    RSS    Df Sum of Sq      F    Pr(>F)    
## 1 3174.0 6683.3                                     
## 2 3155.6 6486.7 18.36     196.6 5.2093 3.083e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(mod4b, pages=1)

plot of chunk unnamed-chunk-18

This suggests that the parametric model we estimated was not quite flexible enough to fit all of the nuances of the data.

This isn’t required to answer the question, but some of you have asked about time-series cross-sectional data. If you don’t want to estimate the model with a bunch of extra dummy variables in it, you could do the within transformation of all of the variables and then estimate models on the within-transformed data to get the fixed-effects estimator.

cc <- q1data[complete.cases(q1data[,c("physint", "polity2", "pop", "all_phones_pc", "gdppc_mp", 
            "revols", "riots", "agdems", "terror_incidents")]), ]
X <- model.matrix(~ physint + polity2 + log(pop) + all_phones_pc + gdppc_mp + 
        I(revols > 0) + agdems + terror_incidents + riots, data=cc)
Xw <- lm(X ~ as.factor(ccode), data=cc)$residuals
Xw <- as.data.frame(Xw[,-1])
Xw$year <- cc$Year  
names(Xw)[6] <- "revols"
names(Xw)[3] <- "pop"
mod4c <- gam(physint ~ s(polity2, bs="ts") + 
           s(pop, bs="ts") + s(all_phones_pc, bs="ts") +
           s(gdppc_mp, bs="ts") + revols + riots +
           agdems + terror_incidents + s(year, bs="ts"), data=Xw)
anova(mod4, mod4b, test="F")

## Analysis of Variance Table
## 
## Model 1: physint ~ bs(polity2, df = 4) + log(pop) + I(all_phones_pc^3.5) + 
##     poly(gdppc_mp, 2) + I(revols > 0) + riots + agdems + terror_incidents
## Model 2: physint ~ s(polity2, bs = "cs") + s(log(pop), bs = "cs") + s(all_phones_pc, 
##     bs = "cs") + s(gdppc_mp, bs = "cs") + I(revols > 0) + riots + 
##     agdems + terror_incidents
##   Res.Df    RSS    Df Sum of Sq      F    Pr(>F)    
## 1 3174.0 6683.3                                     
## 2 3155.6 6486.7 18.36     196.6 5.2093 3.083e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(mod4c, pages=1)

plot of chunk unnamed-chunk-19

Question 2

We’re going to continue to investigate repression in this question with a different dependent variable and potentially many other independent variables. This question uses the q2data.rda file (which wil put an object in your workspace called q2data). You can get this file by either downloading it from the course website https://quantoid.net/teachicpsr/regression3 or by doing the following in R:

q2 <- file("https://quantoid.net/files/reg3/q2data.rda")
load(q2)
close(q2)

There are too many variables to print out the list, but you can generate it with:

searchVarLabels(q2data, "")

I have organized the data so the country identifiers are first, the DV fariss_repress is next, then variables related to democracy and rights, next variables related to conflict and then finally variables related to other characteristics of the country. Before you start your investigation, I want you to take 33% of your data out and save it for later. You can do that as follows:

samps <- sample(1:nrow(q2data), floor(.33*nrow(q2data)), replace=FALSE)
keep <- setdiff(1:nrow(q2data), samps)
dat66 <- q2data %>% filter(1:nrow(q2data) %in% keep)
hold <- q2data %>% filter(1:nrow(q2data) %in% samps)
train33 <- sample(1:nrow(dat66), floor(.5*nrow(dat66)), replace=FALSE)
test33 <- setdiff(1:nrow(dat66), train33)
train <- dat66 %>% filter(1:nrow(dat66) %in% train33)
test <- dat66 %>% filter(1:nrow(dat66) %in% test33)

We’ll do the preliminary investigation on half of the dat66 object and the preliminary testing on the other half. Only once we get to a final model we want to try will we do so on the hold data object.

Much of the work on repression suggests that conflict increases demand for repression and democratic institutions and behaviors tend to reduce it. Estimate a parametric model that captures this idea along with whatever controls you think might be important.

First, I’ll juse the checks and parcomp variables for democracy. I’ll also use terror_binary, guerrilla and riots as the conflict variables. Finally, I’ll include, gdppc_mp, pop, and elec_prod_kwh_pc as the other variables in the model.

mod <- lm(fariss_repress ~ checks + parcomp + terror_binary + guerrilla + riots + 
            gdppc_mp + pop + elec_prod_kwh_pc + Year, data=train)
summary(mod)

## 
## Call:
## lm(formula = fariss_repress ~ checks + parcomp + terror_binary + 
##     guerrilla + riots + gdppc_mp + pop + elec_prod_kwh_pc + Year, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.09642 -0.47263 -0.00339  0.50428  2.51319 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.053e+01  6.395e+00  -1.647 0.099858 .  
## checks           -3.739e-02  1.785e-02  -2.095 0.036445 *  
## parcomp          -3.196e-01  2.262e-02 -14.130  < 2e-16 ***
## terror_binary     6.150e-01  4.954e-02  12.416  < 2e-16 ***
## guerrilla         6.754e-01  6.249e-02  10.808  < 2e-16 ***
## riots             6.959e-02  1.964e-02   3.543 0.000412 ***
## gdppc_mp         -2.973e-05  3.858e-06  -7.706 2.97e-14 ***
## pop               9.003e-07  1.725e-07   5.219 2.16e-07 ***
## elec_prod_kwh_pc -2.962e-06  3.715e-07  -7.972 4.00e-15 ***
## Year              5.656e-03  3.212e-03   1.761 0.078547 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7651 on 1063 degrees of freedom
## Multiple R-squared:  0.6395, Adjusted R-squared:  0.6365 
## F-statistic: 209.5 on 9 and 1063 DF,  p-value: < 2.2e-16

Try some of the machine learning algorithms (CART, random forest, XGBOOST, BART) and evaluate the PDPs and/or ice plots. What to do those things say about the nature of the relationship between the independent variables and repression? Is there any evidence of an interaction?

library(rpart)
cart1 <- rpart(formula(mod), data=train)
plot(cart1)
text(cart1)

plot of chunk unnamed-chunk-24

cart.test <- predict(cart1, newdata=test)
cart.train <- predict(cart1, newdata=train)
cor(cart.test, test$fariss_repress)^2

## [1] 0.6759275

library(randomForest)
rf1 <- randomForest(formula(mod), data=train)
rf.test <- predict(rf1, newdata=test)
rf.train <- predict(rf1, newdata=train)
cor(rf.test, test$fariss_repress)^2

## [1] 0.8470609

library(xgboost)
Xm <- model.matrix(formula(mod), data=train)
Xmtest <- model.matrix(formula(mod), data=test)
y <- model.response(model.frame(formula(mod), data=train))
ytest <- model.response(model.frame(formula(mod), data=test))
dtrain <- xgb.DMatrix(data=Xm, info=list(label=y))
xgb1c <- xgb.cv(data=dtrain, nfold=5, ncross=25, nrounds=50)
xgb1 <- xgboost(data=dtrain, nrounds=50, params=list(objective="reg:linear"))
xgb.test <- predict(xgb1, newdata=Xmtest, nrounds=50)
xgb.train <- predict(xgb1, newdata=Xm, nrounds=50)

For the bartMachine analysis, is used bartMachineCV (which takes a reasonably long time), but it identified the following set of parameters as those that minimize the out of sample error: k=2, nu=10, q=.75, num_trees=200, so I implement that advice below.

library(bartMachine)
Xmdf <- as.data.frame(model.matrix(formula(mod), data=train))
Xmtestdf <- as.data.frame(model.matrix(formula(mod), data=test))
bm1 <- bartMachine(Xmdf, y, k=2, nu=10, q=.75, num_trees=200)

## bartMachine initializing with 200 trees...
## bartMachine vars checked...
## bartMachine java init...
## bartMachine factors created...
## bartMachine before preprocess...
## bartMachine after preprocess... 11 total features...
## bartMachine sigsq estimated...
## bartMachine training data finalized...
## Now building bartMachine for regression ...
## evaluating in sample data...done

bm1.test <- predict(bm1, new_data=Xmtestdf)
bm1.train <- bm1$y_hat_train

Now that we’ve done all of those models, we can look at the in- and out-of-sample performance.

iscor <- cor(cbind(bm1.train, xgb.train, rf.train, cart.train), y)^2
oscor <- cor(cbind(bm1.test, xgb.test, rf.test, cart.test), ytest)^2
cors <- cbind( iscor, oscor)
colnames(cors) <- c("In", "Out")
rownames(cors) <- c("BART", "XGB", "RF", "CART")
cors

##             In       Out
## BART 0.8863910 0.8033032
## XGB  0.9907074 0.8570469
## RF   0.9691669 0.8470609
## CART 0.7166293 0.6759275

We could look to see what the important features are:

bm1.imp <- investigate_var_importance(bm1, type="splits", plot=F)

## .....

bm1.imp <- bm1.imp[[1]]
rf1.imp <- rf1$importance/sum(rf1$importance)
xgb.imp <- xgb.importance(model=xgb1)
cart.imp <- cart1$variable.importance/sum(cart1$variable.importance)
imports <-  xgb.imp[,c(1,4)]
names(imports)[2] <- "XGB"
imports <- cbind(imports, RF=rf1.imp[match(imports[[1]], rownames(rf1.imp))])
imports <- cbind(imports, CART=cart.imp[match(imports[[1]], names(cart.imp))])
imports <- cbind(imports, BART=bm1.imp[match(imports[[1]], names(bm1.imp))])
imports <- imports[,c(1,5,2,3,4)]
imports <- imports[order(rowMeans(imports[,-1]), decreasing = TRUE)]
print.data.frame(imports, digits=2)

##            Feature  BART   XGB    RF   CART
## 1          parcomp 0.141 0.062 0.254 0.3844
## 2         gdppc_mp 0.118 0.196 0.185 0.2222
## 3              pop 0.186 0.253 0.165 0.1036
## 4 elec_prod_kwh_pc 0.161 0.184 0.164 0.1531
## 5           checks 0.098 0.102 0.052 0.0310
## 6             Year 0.090 0.115 0.053 0.0220
## 7        guerrilla 0.079 0.028 0.079 0.0773
## 8    terror_binary 0.060 0.028 0.035 0.0042
## 9            riots 0.067 0.031 0.012 0.0023

In general, it looks like the four features at the top of the table are the most important. We should make the various plots of interest for those variables. First, let’s look at the partial dependence plots.

library(pdp)
cart.pdp <- partial(cart1, data=train, pred.var="parcomp", pred.grid=data.frame(parcomp=1:5))
rf.pdp <- partial(rf1, data=train, pred.var="parcomp", pred.grid=data.frame(parcomp=1:5))
## partial doesn't work with xgboost, so we're cooking one up on our own
# make new data that is just the original data
newdf <- lapply(1:5, function(x)Xm)
# for each one of the new datasets, replace parcomp with a different value from 1 to 5
for(i in 1:5){newdf[[i]][,"parcomp"] <- i}
# generate all of the predictions
xgb.preds1 <- lapply(newdf, function(x)predict(xgb1, data=dtrain, newdata=x))
xgb.partial <- sapply(xgb.preds1, mean)
bart.preds1 <- lapply(newdf, function(x)predict(bm1, new_data=as.data.frame(x)))
bart.partial <- sapply(bart.preds1, mean)
plot.df <- data.frame(
  preds = c(cart.pdp[,2], rf.pdp[,2], xgb.partial, bart.partial), 
  model = factor(rep(c("CART", "RF", "XGB", "BART"), each=5)), 
  parcomp = 1:5)
library(lattice)
xyplot(preds ~ parcomp, groups=model, data=plot.df, type="o", pch=16, 
       auto.key=list(space="top", columns=2, points=F, lines=T), 
       main = "PARCOMP PDP")

plot of chunk unnamed-chunk-30

We could also do the same thing for the values of gdppc_mp. Here, instead of the values 1:5, we need to pick other values along the scale of gdppc_mp.

s <- quantile(q2data$gdppc_mp, probs=seq(.05,.95, by=.1))
cart.pdp <- partial(cart1, data=train, pred.var="gdppc_mp", pred.grid=data.frame(gdppc_mp = s))
rf.pdp <- partial(rf1, data=train, pred.var="gdppc_mp", pred.grid=data.frame(gdppc_mp=s))
## partial doesn't work with xgboost, so we're cooking one up on our own
# make new data that is just the original data
newdf <- lapply(1:length(s), function(x)Xm)
# for each one of the new datasets, replace parcomp with a different value from 1 to 5
for(i in 1:length(s)){newdf[[i]][,"gdppc_mp"] <- s[i]}
# generate all of the predictions
xgb.preds2 <- lapply(newdf, function(x)predict(xgb1, data=dtrain, newdata=x))
xgb.partial <- sapply(xgb.preds2, mean)
bart.preds2 <- lapply(newdf, function(x)predict(bm1, new_data=as.data.frame(x)))
bart.partial <- sapply(bart.preds2, mean)
plot.df <- data.frame(
  preds = c(cart.pdp[,2], rf.pdp[,2], xgb.partial, bart.partial), 
  model = factor(rep(c("CART", "RF", "XGB", "BART"), each=length(s))), 
  gdppc_mp = s)
xyplot(preds ~ gdppc_mp, groups=model, data=plot.df, type="o", pch=16, 
       auto.key=list(space="top", columns=2, points=F, lines=T), 
       main = "GDP/capita PDP")

plot of chunk unnamed-chunk-31

We could also make the ICE plots, which will allow us to evaluate whether or not there is much in the way of any interactions.

library(RColorBrewer)
cols <- brewer.pal(5, "Set1")
library(ICEbox)
cart.ice <- ice(cart1, X=Xmdf, predictor="parcomp")

## .....
## y not passed, so range_y is range of ice curves and sd_y is sd of predictions on real observations

cice <- clusterICE(cart.ice, nClusters=1)

plot of chunk unnamed-chunk-32

rf.ice <- ice(rf1, X=Xm, predictor="parcomp")

## .....
## y not passed, so range_y is range of ice curves and sd_y is sd of predictions on real observations

rice <- clusterICE(rf.ice, nClusters=5, plot_legend=T, colorvec=cols)

plot of chunk unnamed-chunk-33

xgb.ice <- ice(xgb1, X=Xm, predictor="parcomp")

## .....
## y not passed, so range_y is range of ice curves and sd_y is sd of predictions on real observations

xice <- clusterICE(xgb.ice, nClusters=5, plot_legend=T, colorvec=cols)

plot of chunk unnamed-chunk-34

bart.ice <- ice(bm1, X=Xmdf, predictor="parcomp")

## .....
## y not passed, so range_y is range of ice curves and sd_y is sd of predictions on real observations

bice <- clusterICE(bart.ice, nClusters=5, plot_legend=T, colorvec=cols)

plot of chunk unnamed-chunk-35

And, then we could do the same thing for GDP/capita.

cart.ice <- ice(cart1, X=Xmdf, predictor="gdppc_mp")

## ..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
## y not passed, so range_y is range of ice curves and sd_y is sd of predictions on real observations

cice <- clusterICE(cart.ice, nClusters=1)

plot of chunk unnamed-chunk-36

rf.ice <- ice(rf1, X=Xm, predictor="gdppc_mp")

## ..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
## y not passed, so range_y is range of ice curves and sd_y is sd of predictions on real observations

rice <- clusterICE(rf.ice, nClusters=5, plot_legend=T, colorvec=cols)

plot of chunk unnamed-chunk-37

xgb.ice <- ice(xgb1, X=Xm, predictor="gdppc_mp")

## ..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
## y not passed, so range_y is range of ice curves and sd_y is sd of predictions on real observations

xice <- clusterICE(xgb.ice, nClusters=5, plot_legend=T, colorvec=cols)

plot of chunk unnamed-chunk-38

bart.ice <- ice(bm1, X=Xmdf, predictor="gdppc_mp", num_grid_pts=10)

## ..........
## y not passed, so range_y is range of ice curves and sd_y is sd of predictions on real observations

bice <- clusterICE(bart.ice, nClusters=5, plot_legend=T, colorvec=cols)

plot of chunk unnamed-chunk-39

There is some sense from looking at the plots that non-linearity and interactions could be a problem.

Use the diagFun function that we talked about in class to diagnose with your parametric model. What, if any, problems exist.

source("https://quantoid.net/files/reg3/diagfun.r")
mod <- update(mod, data=dat66)
out <- diagFun(mod, data=dat66, splitSample=T, returnMods=T, partialFirst="X_res")

out$r2

##        Orig  polywogInt     marsInt   polywogNL      marsNL    marsFull 
##   0.7783776   0.7665614   0.8079523   0.8880682   0.9026894   0.9069571 
## polywogFull 
##   0.8345241

print(out$coefs)

## polywogInt 
##               Estimate  Std. Error t value  Pr(>|t|)    
## gdppc_mp   -4.3774e+10  1.6228e+10 -2.6974 0.0070992 ** 
## riots*Year  1.0385e-02  3.0662e-03  3.3869 0.0007328 ***
## pop*Year    4.3736e-08  2.0875e-08  2.0951 0.0363978 *  
## pop        -2.7484e+09  9.4408e+08 -2.9112 0.0036754 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## marsInt 
##                              Estimate  Std. Error t value  Pr(>|t|)    
## parcomp*pop                8.0084e-07  1.2536e-07  6.3883 2.508e-10 ***
## gdppc_mp*elec_prod_kwh_pc  9.0367e-11  2.1241e-11  4.2544 2.281e-05 ***
## guerrilla*pop             -1.5916e-06  2.4946e-07 -6.3802 2.640e-10 ***
## checks*parcomp            -3.4812e-02  1.1005e-02 -3.1632  0.001605 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## polywogNL 
##                Estimate  Std. Error t value  Pr(>|t|)    
## checks^2     7.6315e-03  2.6918e-03  2.8351  0.004669 ** 
## parcomp^2    5.6852e-01  1.1660e-01  4.8760 1.250e-06 ***
## guerrilla^2 -3.2512e-01  7.2101e-02 -4.5092 7.237e-06 ***
## riots^2     -4.2124e-03  1.6793e-03 -2.5085  0.012275 *  
## gdppc_mp^2   1.8667e-09  6.0372e-10  3.0921  0.002040 ** 
## pop^2       -1.8925e-11  2.1341e-12 -8.8676 < 2.2e-16 ***
## parcomp^3   -7.5522e-02  1.3008e-02 -5.8058 8.472e-09 ***
## gdppc_mp^3  -1.6576e-14  7.1413e-15 -2.3211  0.020470 *  
## pop^3        8.8544e-18  1.3459e-18  6.5787 7.465e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## marsNL 
##                     Estimate  Std. Error t value  Pr(>|t|)    
## h(parcomp-4)     -7.3546e-01  8.7479e-02 -8.4073 < 2.2e-16 ***
## h(pop-44927)     -1.2166e-05  2.6716e-06 -4.5538 5.883e-06 ***
## h(guerrilla-1)   -7.2548e-01  1.7643e-01 -4.1120 4.228e-05 ***
## h(checks-6)       1.2959e-01  4.5349e-02  2.8577 0.0043515 ** 
## h(gdppc_mp-9810)  4.8145e-05  1.4585e-05  3.3011 0.0009954 ***
## h(pop-4018)      -1.2839e-04  2.3467e-05 -5.4711 5.585e-08 ***
## h(pop-165838)    -3.2328e-06  1.3062e-06 -2.4749 0.0134832 *  
## gdppc_mp         -3.2503e+10  1.4076e+10 -2.3092 0.0211276 *  
## pop              -1.8648e+09  8.1913e+08 -2.2766 0.0230120 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## marsFULL 
##                                          Estimate  Std. Error t value
## h(parcomp-4)                          -7.7318e-01  9.8800e-02 -7.8257
## h(44927-pop)                          -1.6727e-05  2.7538e-06 -6.0742
## h(guerrilla-1)                        -6.8257e-01  1.7516e-01 -3.8968
## h(checks-6)                            1.2611e-01  4.5185e-02  2.7909
## h(9810-gdppc_mp)                       4.3843e-05  1.4539e-05  3.0155
## h(3802-pop)*h(98928-elec_prod_kwh_pc) -2.1349e-09  3.3800e-10 -6.3164
## h(pop-4538)*h(elec_prod_kwh_pc-98928)  5.1225e-11  1.8941e-11  2.7044
## h(4538-pop)*h(elec_prod_kwh_pc-98928) -9.7277e-10  3.9824e-10 -2.4427
## h(parcomp-4)*h(pop-2816)              -5.5002e-06  2.7478e-06 -2.0017
## h(parcomp-4)*h(2816-pop)               2.5382e-04  1.0031e-04  2.5303
##                                        Pr(>|t|)    
## h(parcomp-4)                          1.226e-14 ***
## h(44927-pop)                          1.739e-09 ***
## h(guerrilla-1)                        0.0001037 ***
## h(checks-6)                           0.0053513 ** 
## h(9810-gdppc_mp)                      0.0026273 ** 
## h(3802-pop)*h(98928-elec_prod_kwh_pc) 3.944e-10 ***
## h(pop-4538)*h(elec_prod_kwh_pc-98928) 0.0069532 ** 
## h(4538-pop)*h(elec_prod_kwh_pc-98928) 0.0147417 *  
## h(parcomp-4)*h(pop-2816)              0.0455754 *  
## h(parcomp-4)*h(2816-pop)              0.0115411 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## polywogFull 
##                                   Estimate  Std. Error  t value  Pr(>|t|)
## parcomp*elec_prod_kwh_pc       -3.8969e-04  8.5749e-05  -4.5446 6.149e-06
## riots*Year                      1.4667e-02  4.5859e-03   3.1983  0.001424
## gdppc_mp*elec_prod_kwh_pc       3.4389e-08  1.1006e-08   3.1246  0.001830
## parcomp*elec_prod_kwh_pc*Year   1.9448e-07  4.2957e-08   4.5273 6.663e-06
## guerrilla*pop*Year             -8.7351e-10  1.2206e-10  -7.1564 1.561e-12
## gdppc_mp*elec_prod_kwh_pc*Year -1.7135e-11  5.5014e-12  -3.1147  0.001892
## pop^2*Year                     -3.8968e-15  3.0255e-16 -12.8802 < 2.2e-16
##                                   
## parcomp*elec_prod_kwh_pc       ***
## riots*Year                     ** 
## gdppc_mp*elec_prod_kwh_pc      ** 
## parcomp*elec_prod_kwh_pc*Year  ***
## guerrilla*pop*Year             ***
## gdppc_mp*elec_prod_kwh_pc*Year ** 
## pop^2*Year                     ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It looks like the marsNL model is probably the right mix of flexibility and explanatory parsimony. Because there are no interactions, the plotmo function should prove interesting to identify the nature of the non-linearities.

plotmo(out$marsNL, pt.col=1)

##  plotmo grid:    checks parcomp terror_binary guerrilla riots gdppc_mp
##                       3       3             1         0     0     1540
##   pop elec_prod_kwh_pc Year
##  8980            21258 1995

plot of chunk unnamed-chunk-41

There are a few big outliers on the checks variable, which seem to be driving the non-linearity in that variable. The parcomp variable probably should be dummied out and represented with 4 regressors. That’s something we might have done from the beginning anyway. It appears that the main variation on guerrilla is between 0 and non-0, so that could be dichotomized. It looks, from the plot, like gdppc_mp and pop could both be logged. Let’s look again at what would have happened if we had specified these ahead of time.

mod2 <- lm(fariss_repress ~ log(checks) + as.factor(parcomp) + terror_binary + I(guerrilla > 0) + riots + 
            poly(gdppc_mp,3) + log(pop) + elec_prod_kwh_pc + Year, data=dat66)
out <- diagFun(mod2, data=dat66, splitSample=T, returnMods=T, partialFirst="X_res")
out$r2

##        Orig  polywogInt     marsInt   polywogNL      marsNL    marsFull 
##   0.9067413   0.8106342   0.7726723   0.8876238   0.9126733   0.9173652 
## polywogFull 
##   0.8294572

Specifying the model this way indicates that there is very little left over for the non-parametric model to do.

Now, put all of the possible independent variables into the machine learning algorithms above. How do the results of these models compare to the parameteric model and to the results from the machine learning model on a subset of the variables?

cart.all <- rpart(fariss_repress ~ ., data=train[,-c(1,3,4)])
rf.all <- randomForest(fariss_repress ~ ., data=train[,-c(1,3,4)])
X.all <- model.matrix(fariss_repress ~ ., data=train[,-c(1,3,4)])
X.all.test <- model.matrix(fariss_repress ~ ., data=test[,-c(1,3,4)])
y.all <- model.response(model.frame(fariss_repress ~ ., data=train[,-c(1,3,4)]))
#xgb.all <- xgb.cv(data=X.all, label=y.all, watchlist=list(data=X.all.test), nfold=10, nrounds=100)
xgb.all <- xgboost(data=X.all, label=y.all, nrounds=16)

## [1]  train-rmse:1.096257 
## [2]  train-rmse:0.834676 
## [3]  train-rmse:0.650873 
## [4]  train-rmse:0.511497 
## [5]  train-rmse:0.414151 
## [6]  train-rmse:0.349241 
## [7]  train-rmse:0.299354 
## [8]  train-rmse:0.265453 
## [9]  train-rmse:0.240429 
## [10] train-rmse:0.224391 
## [11] train-rmse:0.202586 
## [12] train-rmse:0.184972 
## [13] train-rmse:0.169743 
## [14] train-rmse:0.157192 
## [15] train-rmse:0.148472 
## [16] train-rmse:0.138863

bart.all <- bartMachine(X=as.data.frame(X.all), y.all, k=2, nu=10, q=.75, num_trees=200)

## bartMachine initializing with 200 trees...
## bartMachine vars checked...
## bartMachine java init...
## bartMachine factors created...
## bartMachine before preprocess...
## bartMachine after preprocess... 90 total features...
## bartMachine sigsq estimated...
## bartMachine training data finalized...
## Now building bartMachine for regression ...
## evaluating in sample data...done

bart.imp <- investigate_var_importance(bart.all, type="splits", plot=F)

## .....

bart.imp <- bart.imp[[1]]
rf.imp <- rf.all$importance/sum(rf.all$importance)
xgb.imp <- xgb.importance(model=xgb.all)
cart.imp <- cart.all$variable.importance/sum(cart.all$variable.importance)
imports <-  xgb.imp[,c(1,4)]
names(imports)[2] <- "XGB"
imports <- cbind(imports, RF=rf.imp[match(imports[[1]], rownames(rf.imp))])
imports <- cbind(imports, CART=cart.imp[match(imports[[1]], names(cart.imp))])
imports <- cbind(imports, BART=bart.imp[match(imports[[1]], names(bart.imp))])
imports <- imports[,c(1,5,2,3,4)]
imports <- imports[order(rowMeans(imports[,-1]), decreasing = TRUE)]
head(imports, n=10)

##               Feature        BART         XGB          RF        CART
##  1:           polity2 0.019499217 0.029370629 0.155350051 0.169758103
##  2:           parcomp 0.028937106 0.005594406 0.124756251 0.146501797
##  3:       infant_mort 0.019856748 0.016783217 0.059905870 0.087056231
##  4:   conflict_binary 0.020693558 0.006993007 0.085212582 0.064700001
##  5: phones_no_cell_pc 0.025752703 0.009790210 0.025125823 0.099098788
##  6:          gdppc_fc 0.011456819 0.015384615 0.012951857 0.081516353
##  7:          gdppc_mp 0.006461439 0.012587413 0.011777230 0.082482899
##  8:              Year 0.021559702 0.071328671 0.007856007 0.006434407
##  9:    primsch_enroll 0.016311200 0.015384615 0.038905061 0.027442217
## 10:     conflictindex 0.019071514 0.019580420 0.032491644 0.018185946

It seems that some of the features I neglected are important in trying to predict the outcome. Here’s what the in- and out-of-sample predictive capacity look like.

cart.testa <- predict(cart.all, newdata=test)
cart.traina <- predict(cart.all, newdata=train)
rf.testa <- predict(rf.all, newdata=test)
rf.traina <- predict(rf.all, newdata=train)
xgb.testa <- predict(xgb.all, newdata=X.all.test, nrounds=50)
xgb.traina <- predict(xgb.all, newdata=X.all, nrounds=50)
bm1.testa <- predict(bart.all, new_data=as.data.frame(X.all.test))
bm1.traina <- bm1$y_hat_train


iscora <- cor(cbind(bm1.traina, xgb.traina, rf.traina, cart.traina), y)^2
oscora <- cor(cbind(bm1.testa, xgb.testa, rf.testa, cart.testa), ytest)^2
corsa <- cbind( iscora, oscora)
colnames(corsa) <- c("In", "Out")
rownames(corsa) <- c("BART", "XGB", "RF", "CART")
corsa

##             In       Out
## BART 0.8863910 0.9060813
## XGB  0.9886118 0.8927250
## RF   0.9886008 0.9155704
## CART 0.7840297 0.7389020

Compare this to the in- and out-of-sample predictions from the models with just our selected few variables in them:

cors

##             In       Out
## BART 0.8863910 0.8033032
## XGB  0.9907074 0.8570469
## RF   0.9691669 0.8470609
## CART 0.7166293 0.6759275

What we get is some additional out-of-sample predictive capacity when we cast our net a bit wider.