Preliminaries

R By Example is intended to walk users through transitioning to R from other software. In particular, it supports the course of the same name that I have taught through the ICPSR Summer Program since 2015.

License

Creative Commons Licence

This book, in its entirety, is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

The code contained from this book is also licensed under the MIT license; under which you are permitted to use it in your own packages providing you cite the source.

In the ensuing pages, we will walk through lots of statistical models with only the the slightest attention paid to the underlying statistical theory. As such, this is not so much a book (or a course) about statistics as it is a book about learning how to run and evaluate models you already know in R.

To following along in the book, you should have the most recent versions of R and RStudio

Rather than slides, I have decided to distribute this book that has more prose in than slides would permit. The idea is to provide something that will serve as a slightly more comprehensive reference as you start to employ R in your own analyses. There is an ever increasing number of R books out there. The ones I particularly like are:

Getting R and Rstudio

R is an object-oriented statistical programming environment. It remains largely command-line driven. There are a couple of attempts at generating point-and-click GUIs for R, but these are almost necessarily limited in scope and tend to be geared toward undergraduate research methods students - both RCommander and Jamovi are good examples. R is open-source (i.e., free) and downloadable from CRAN. Click the link for your operating system. In Windows, click on the link for base and then the link for “Download R for Windows.” Once it is done, double-click on the resulting file and that will guide you through the installation process. There are some decisions to be made, but if you’re unsure, following the defaults is generally not a bad idea. We will not interact with R thorugh its own GUI, but through RStudio, so the particularities of the setup will be unimportant in all but the most unusual cases. For Mac users, click on the link for “Download R for Mac” on the CRAN home page and then click the “R-.pkg” link. This book is compiled with R version 4.1.0.

You should also download Rstudio, an Integrated Development Environment (IDE) for R. This application sits on top of your existing R installation (i.e., it also requires you to install R separately) to provide some nice text editing functions along with some other nice features. I’ve spent a considerable amount of time using other competing IDEs - WinEDT (a long time ago), TextMate, Atom, Sublime and Microsoft’s VS Code. There were things about all of them that I really liked, but ultimately for someone whose workflow is almost entirely in R, Markdown and LaTeX, there is little reason to move to something else. RStudio is also increasingly becoming a suitable IDE for other languages,too, e.g., Python.

In RStudio, you can change themes (color schemes), fonts and other aspects of the appearance. You can also use (and optionally set) shortcut keys for your own favorite operations, too. Some that I use regularly are:

  • ctrl +p finds the matching bracket
  • ctrl + shift + e expands the selection to the matching bracket
  • I also mapped the “quick add next” operation, which allows you to highlight a single instance of a string and then highlight the next one with the click of a keystroke. Then, you have multiple cursors at each instance that you can use to make multiple changes at once.

Keeping Track of Your Work

In general, I’m a big fan of using RMarkdown documents to keep track of your work. They provide a great format for including both prose (that can explain to your colleagues, readers and even your future self what you did) and R code. It also could then form the basis for a paper or book you could write using RStudio.

I would encourage you to write in RMarkdown files that parallel the chapters of the book, so you can keep track of what you are doing to complete the exercises.

Using R

Like Stata and SAS, R has an active user-developer community. This is attractive as the types of models and situations R can deal with is always expanding. Unlike Stata, in R, you have to load the packages you need before you’re able to access the commands within those packages. All openly distributed packages are available from the Comprehensive R Archive Network, though some of them come with the Base version of R. To see what packages you have available, type library() or click on the “Packages” tab in the files panel. There are two related functions that you will need to obtain new packages for R.

  • install.packages() will download the relevant source code from R and install it on your machine. This step only has to be done once until you upgrade to a new minor (or major) version of R. For example, if you upgrade from 3.5.0 to 3.5.1, all of the packages you downloaded will still be available. In this step, a dialog box will ask you to choose a CRAN mirror - this is one of many sites that maintain complete archives of all of R’s user-developed packages. Usually, the advice is to pick one close to you (or the cloud option).

  • library() will make the commands in the packages you downloaded available to you in the current R session (a new session starts each time R is started and continues until that instance of R is terminated). As suggested this has to be done (when you want to use functions other than those loaded automatically) each time you start R. There is an option to have R load packages automatically on startup by modifying the .RProfile file (more on that later).

You can accomplish the same thing through the “packages” tab in the files panel. Each package that has been installed has a checkbox next to it. You can load the package by checking the checkbox. There is a search bar to help you locate your package without endless scrolling. There is also an “install” button in the upper right-hand corner of the packages tab. Clicking that will open an install packages dialog where you can type in the name of the package you want to install.

The “object-oriented” nature of R means that you’re generally saving the results of commands into objects that you can access whenever you want and manipulate with other commands. R is a case-sensitive environment, so be careful how you name and access objects in the space and be careful how you call functions lm() \(\neq\) LM().

There are a few tips that don’t really belong anywhere, but are nonetheless important, so I’ll just mention them here and you can refer back when they become relevant.

  • In RStudio, if you position your cursor in a line you want to execute (or block text you want to execute), then hit ctrl+enter on a PC or command+enter on the mac, the functions will be automatically executed.
  • You can return to the command you previously entered in the R console by hitting the “up arrow” (similar to “Page Up” in Stata).
  • You can find out what directory R is in by typing getwd().
  • You can set the working directory of R by typing setwd(path) where path is the full path to the directory you want to use. The directories must be separated by forward slashes / and the entire string must be in quotes (either double or single). For example: setwd("C:/users/david/desktop"). You can also do this through the “Session” dropdown menu where you can select “Set Working Directory” as an option.
  • To see the values in any object, just type that object’s name into the command window and hit enter (or look in the object browser in RStudio).

Assigning Output to Objects

In this section, we will spend a bit of time on the very basics of coding in R. Some of this may seem tedious, but this is a good way of getting to understand how R works and we won’t spend too long here.

R can be used as a big calculator. By typing 2+2 into R, you will get the following output:

2+2
## [1] 4

After my input of 2+2, R has provided the output of 4, the evaluation of that mathematical expression. R just prints this output to the console. Doing it this way, the output is not saved per se. Notice, that unlike Stata, you do not have to ask R to “display” anything in Stata, you would have to type display 2+2 to get the same result. Often times, we want to save the output so we can look at it later. The assignment character in R is <- (the less-than sign directly followed by the minus sign). You may hear me say “X gets 10,” in R, this would translate to

X <- 10
X
## [1] 10

You can also use the = as the assignment character. When I started using R, people wren’t doing this, so I haven’t changed over yet, but the following is an equivalent way of specifying the above statement:

X = 10
X
## [1] 10

As with any convention that doesn’t matter much, there are dogmatic adherents on either side of the debate. Some argue that the code is easier to read using the arrow. Others argue that using a single keystroke to produce the assignment character is more efficient. In truth, both are probably right. Your choice is really a matter of taste.

To assign the output of an evaluated function to an object, just put the object on the left-hand side of the arrow and the function on the right-hand side.

X <- 4+4

Now the object X contains the evaluation of the expression 4+4 or 8. We see the contents of X simply by typing its name at the command prompt and hitting enter. In the above command, we’re assigning the output (or result) of the command 4+4 to X.

X
## [1] 8

There are a few things worth noting here:

  1. The object does not always save the call - the code that produced the output. Models tend to save this kind of information, but simpler mathematical calculations do not.
  2. There is no “undo” button and no warning that something is about to be overwritten. The onus is on you to make sure that you keep track of the names of objects you want to keep. Usually, this is not problematic as if you are keeping good track of what you’re doing, you’ll be able to re-generate any previous result easily.

Vectors and Matrices

We can make a vector of numbers by combining a bunch of numbers together using the c() function. This collects the numbers together in a single object.

Y <- c(1,2,3,4)
Y
## [1] 1 2 3 4

We can also do math on vectors. For example, we can add a scalar and it will add the scalar to each element of the vector:

Y +3
## [1] 4 5 6 7

The matrix() function allows us to make matrices.

mat1 <- matrix(c(1,2,3,4), nrow=2, ncol=2)
mat1
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

The default orientation of matrices is in column-major format, meaning that each column of the matrix is filled in until all of the numbers in the vector have been used. You could also fill in by rows by using the byrow=TRUE argument to the matrix() function.

mat2 <- matrix(c(1,2,3,4), nrow=2, ncol=2, byrow=TRUE)
mat2
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

Function Arguments

As we will see throughout this book, each R function can take a number of arguments. These arguments and what they do are detailed in the function’s help file. You can access a function’s help file by typing help matrix or ?matrix; or, you can click on the “Help” tab in the files pane on the left. You can then use the search bar in the upper-right hand corner of the panel to search for a function’s help file.

Arguments can take lots of forms, but the most common are:

  • formula - a model specification that takes the form: outcome ~ covariate1 + covariate2 for additive functions and outcome ~ covariate1*covariate2 for conditional or multiplicative functions.
  • logical - either TRUE or FALSE (must be in all caps), BTW - TRUE has a numerical value of 1 and FALSE has a numerical value of 0.
  • string - A character string, must be in quotes (single or double quotes are fine, so long as they match).
  • data - Generally a data frame or something that can be coerced to a data frame.
  • vector or list - allow multiple values to be passed to a single argument.

For example, looking at the help file for the matrix() function, we see that the first argument data is a vector, nrow and ncol are scalars (single numbers) and byrow is a logical value.

Importing Data

Before we move on to more complicated operations and more intricacies of dealing with data, the one thing everyone wants to know is - “How do I get my data into R?” As it turns out, the answer is - “quite easily.” There are a number of packages in R that can help read in from and write out to other statistical environments. I personally like the rio package. This package is actually a wrapper to lots of different packages for importing and exporting data in other formats. It automatically identifies the data type from its extension and uses the correct importer for the data.
Let’s look at a couple of examples.

The dataset we’ll be using here has three variables - x1, (a numeric variable), x2 (a labeled numeric variable [0=none, 1=some]) and x3 a string variable (“no” and “yes”). I’ve called this dataset r_example.sav (SPSS) and r_example.dta (Stata).

R has lots of different data structures available (e.g., arrays, lists, ect…). The one that we are going to be concerned with right now is the data frame; the R terminology for a dataset. A data frame can have different types of variables in it (i.e., character and numeric). It is rectangular (i.e., all rows have the same number of columns and all columns have the same number of rows. There are some more distinctions that make the data frame special, but we’ll talk about those later.

## load rio package - only need to do this once per R session
library(rio)
## load data, note the relative path to the dataset from the current directory
spss.dat <- import("data/r_example.sav")
## print the contents
spss.dat
##    x1 x2  x3
## 1   1  0 yes
## 2   2  0  no
## 3   3  1  no
## 4   4  0 yes
## 5   3  0  no
## 6   4  0 yes
## 7   1  1 yes
## 8   2  1 yes
## 9   5  1  no
## 10  6  0  no

There is also an example Stata dataset, which you could read in as follows:

stata.dat <- import("data/r_example.dta")

Data Types in R

This is a convenient time to talk about different types of data in R. There are basically three different types of variables - numeric variables, factors and character strings.

  • Numeric variables would be something like GDP/capita, age or income (in $). Generally, these variables do not contain labels because they have many unique values. Dummy variables are also numeric with values 0 and 1. R will only do mathematical operations on numeric variables (e.g., mean, variance, etc…).
  • Factors are variables like social class or party for which you voted. When you think about how to include variables in a model, factors are variables that you would include by making a set of category dummy variables. Factors in R look like numeric variables with value labels in either Stata or SPSS. That is to say that there is a numbering scheme where each unique label value gets a unique number (all non-labeled values are coded as missing). Unlike in those other programs, R will not let you perform mathematical operations on factors.
  • Character strings are simply text. There is no numbering scheme with corresponding labels, the value in each cell is simply that cell’s text, not a number with a corresponding label like in a factor.

Using the rio package, it reads numeric variables with labels as numbers, but it attaches an attribute to the variable called labels which can be used to turn the variable into a factor. Note the difference in the output below between x2 and x3 - x2 is numeric with a labels attribute and x3 is a character string (denoted with chr).

str(spss.dat)
## 'data.frame':    10 obs. of  3 variables:
##  $ x1: num  1 2 3 4 3 4 1 2 5 6
##   ..- attr(*, "label")= chr "x1"
##   ..- attr(*, "format.spss")= chr "F8.2"
##  $ x2: num  0 0 1 0 0 0 1 1 1 0
##   ..- attr(*, "label")= chr "x2"
##   ..- attr(*, "format.spss")= chr "F8.2"
##   ..- attr(*, "labels")= Named num [1:2] 0 1
##   .. ..- attr(*, "names")= chr [1:2] "none" "some"
##  $ x3: chr  "yes" "no" "no" "yes" ...
##   ..- attr(*, "label")= chr "x3"
##   ..- attr(*, "format.spss")= chr "A3"
##   ..- attr(*, "display_width")= int 11

To turn x2 into a factor, we could use the function factorize() that’s in the rio package.

spss.dat$x2_fac <- factorize(spss.dat$x2)
str(spss.dat)
## 'data.frame':    10 obs. of  4 variables:
##  $ x1    : num  1 2 3 4 3 4 1 2 5 6
##   ..- attr(*, "label")= chr "x1"
##   ..- attr(*, "format.spss")= chr "F8.2"
##  $ x2    : num  0 0 1 0 0 0 1 1 1 0
##   ..- attr(*, "label")= chr "x2"
##   ..- attr(*, "format.spss")= chr "F8.2"
##   ..- attr(*, "labels")= Named num [1:2] 0 1
##   .. ..- attr(*, "names")= chr [1:2] "none" "some"
##  $ x3    : chr  "yes" "no" "no" "yes" ...
##   ..- attr(*, "label")= chr "x3"
##   ..- attr(*, "format.spss")= chr "A3"
##   ..- attr(*, "display_width")= int 11
##  $ x2_fac: Factor w/ 2 levels "none","some": 1 1 2 1 1 1 2 2 2 1
##   ..- attr(*, "label")= chr "x2"

Examining Data

There are a few different methods for examining the properties of your data. The first will tell you what type of data are in your data frame and gives a sense of what some representative values are. The str command shows the structure of your dataset along with any attributes of the variables that would be otherwise hidden from view.

str(spss.dat)
## 'data.frame':    10 obs. of  4 variables:
##  $ x1    : num  1 2 3 4 3 4 1 2 5 6
##   ..- attr(*, "label")= chr "x1"
##   ..- attr(*, "format.spss")= chr "F8.2"
##  $ x2    : num  0 0 1 0 0 0 1 1 1 0
##   ..- attr(*, "label")= chr "x2"
##   ..- attr(*, "format.spss")= chr "F8.2"
##   ..- attr(*, "labels")= Named num [1:2] 0 1
##   .. ..- attr(*, "names")= chr [1:2] "none" "some"
##  $ x3    : chr  "yes" "no" "no" "yes" ...
##   ..- attr(*, "label")= chr "x3"
##   ..- attr(*, "format.spss")= chr "A3"
##   ..- attr(*, "display_width")= int 11
##  $ x2_fac: Factor w/ 2 levels "none","some": 1 1 2 1 1 1 2 2 2 1
##   ..- attr(*, "label")= chr "x2"

The second method is a more substantive summary. The skimr package has a function called skim() that provides a nice summary depending on the variable type.

library(skimr)
skim_tee(spss.dat)
## ── Data Summary ────────────────────────
##                            Values
## Name                       data  
## Number of rows             10    
## Number of columns          4     
## _______________________          
## Column type frequency:           
##   character                1     
##   factor                   1     
##   numeric                  2     
## ________________________         
## Group variables            None  
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 x3                    0             1     2     3     0        2          0
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts    
## 1 x2_fac                0             1 FALSE          2 non: 6, som: 4
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
## 1 x1                    0             1   3.1 1.66      1     2     3     4     6 ▇▃▃▂▂
## 2 x2                    0             1   0.4 0.516     0     0     0     1     1 ▇▁▁▁▅

Missing Values

In R, missing data are indicated with NA (similar to the ., or .a, .b, etc…, in Stata). The dataset r_example_miss.dta, looks like this in Stata:

. list

     +-----------------+
     | x1     x2    x3 |
     |-----------------|
  1. |  1   none   yes |
  2. |  2   none    no |
  3. |  .   some    no |
  4. |  4      .   yes |
  5. |  3   none    no |
     |-----------------|
  6. |  4   none   yes |
  7. |  1   some   yes |
  8. |  2   some   yes |
  9. |  5   some    no |
 10. |  6   none    no |
     +-----------------+

Notice that it looks like values are missing on all three variables. Let’s read the data into R and see what happens.

stata2.dat <- import("data/r_example_miss.dta")
stata2.dat$x2_fac <- factorize(stata2.dat$x2)
stata2.dat
##    x1 x2  x3 x2_fac
## 1   1  0 yes   none
## 2   2  0  no   none
## 3  NA  1  no   some
## 4   4 NA yes   <NA>
## 5   3  0  no   none
## 6   4  0 yes   none
## 7   1  1 yes   some
## 8   2  1 yes   some
## 9   5  1  no   some
## 10  6  0  no   none

Notice that the missing elements are NA.

There are a few different methods for dealing with missing values, though they produce the same statistical result, they have different post-estimation behavior. These are specified through the na.action argument to modeling commands and you can see how these work by using the help functions: ?na.action. In lots of the things we do, we will have to give the argument na.rm=TRUE to remove the missing data from the calculation (i.e., listwise delete).

Filtering with Logical Expressions and Sorting

A logical expression is one that evaluates to either TRUE (the condition is met) or FALSE (the condition is not met). There are a few operators you need to know (which are the same as the operators in Stata or SPSS).

  • EQUALITY == (two equal signs) is the symbol for logical equality. A == B evaluates to TRUE if A is equivalent to B and evaluates to FALSE otherwise.
  • INEQUALITY != is the command for inequality. A != B evaluates to TRUE when A is not equivalent to B.
  • AND & is the conjunction operator. A & B would evaluate to TRUE if both A and B were met. It would evaluate to FALSE if either A and/or B were not met.
  • OR | (the pipe character) is the logical or operator. A | B would evaluate to TRUE if either A and/or B is met and would evaluate to FALSE only if neither A nor B were met.
  • NOT ! (the exclamation point) is the character for logical negation. !(A & B) is the mirror image of (A & B) such that the latter evaluates to TRUE when the former evaluates to FALSE.

When using these with variables, the conditions for factors and character strings should be specified with characters. With numeric variables, the conditions should be specified using numbers. A few examples will help to illuminate things here.

stata.dat$x3 == "yes"
##  [1]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE
stata.dat$x2_fac == "none"
## logical(0)
stata.dat$x2 == 1
##  [1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE
stata.dat$x1 == 2
##  [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE