R: Learning by Example
2021-07-19
Preliminaries
R By Example is intended to walk users through transitioning to R from other software. In particular, it supports the course of the same name that I have taught through the ICPSR Summer Program since 2015.
License
This book, in its entirety, is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
The code contained from this book is also licensed under the MIT license; under which you are permitted to use it in your own packages providing you cite the source.
In the ensuing pages, we will walk through lots of statistical models with only the the slightest attention paid to the underlying statistical theory. As such, this is not so much a book (or a course) about statistics as it is a book about learning how to run and evaluate models you already know in R.
To following along in the book, you should have the most recent versions of R and RStudio
Rather than slides, I have decided to distribute this book that has more prose in than slides would permit. The idea is to provide something that will serve as a slightly more comprehensive reference as you start to employ R in your own analyses. There is an ever increasing number of R books out there. The ones I particularly like are:
- John Fox and Sanford Weisberg’s An R Companion to Applied Regression - the first edition of this book is how I learned R in 2002.
- Robert Kabacoff’s R in Action - this is a good “from first principles” book about R that I routinely recommend to people.
- Chester Ismay and Albert Kim’s Statistical Inference via Data Science - has some good introductory chapters that would be particularly useful for this audience.
Getting R and Rstudio
R is an object-oriented statistical programming environment. It remains largely command-line driven. There are a couple of attempts at generating point-and-click GUIs for R, but these are almost necessarily limited in scope and tend to be geared toward undergraduate research methods students - both RCommander and Jamovi are good examples. R is open-source (i.e., free) and downloadable from CRAN. Click the link for your operating system. In Windows, click on the link for base
and then the link for “Download R
You should also download Rstudio, an Integrated Development Environment (IDE) for R. This application sits on top of your existing R installation (i.e., it also requires you to install R separately) to provide some nice text editing functions along with some other nice features. I’ve spent a considerable amount of time using other competing IDEs - WinEDT (a long time ago), TextMate, Atom, Sublime and Microsoft’s VS Code. There were things about all of them that I really liked, but ultimately for someone whose workflow is almost entirely in R, Markdown and LaTeX, there is little reason to move to something else. RStudio is also increasingly becoming a suitable IDE for other languages,too, e.g., Python.
In RStudio, you can change themes (color schemes), fonts and other aspects of the appearance. You can also use (and optionally set) shortcut keys for your own favorite operations, too. Some that I use regularly are:
ctrl +p
finds the matching bracketctrl + shift + e
expands the selection to the matching bracket- I also mapped the “quick add next” operation, which allows you to highlight a single instance of a string and then highlight the next one with the click of a keystroke. Then, you have multiple cursors at each instance that you can use to make multiple changes at once.
Keeping Track of Your Work
In general, I’m a big fan of using RMarkdown documents to keep track of your work. They provide a great format for including both prose (that can explain to your colleagues, readers and even your future self what you did) and R code. It also could then form the basis for a paper or book you could write using RStudio.
I would encourage you to write in RMarkdown files that parallel the chapters of the book, so you can keep track of what you are doing to complete the exercises.
Using R
Like Stata and SAS, R has an active user-developer community. This is attractive as the types of models and situations R can deal with is always expanding. Unlike Stata, in R, you have to load the packages you need before you’re able to access the commands within those packages. All openly distributed packages are available from the Comprehensive R Archive Network, though some of them come with the Base version of R. To see what packages you have available, type library()
or click on the “Packages” tab in the files panel. There are two related functions that you will need to obtain new packages for R.
install.packages()
will download the relevant source code from R and install it on your machine. This step only has to be done once until you upgrade to a new minor (or major) version of R. For example, if you upgrade from 3.5.0 to 3.5.1, all of the packages you downloaded will still be available. In this step, a dialog box will ask you to choose a CRAN mirror - this is one of many sites that maintain complete archives of all of R’s user-developed packages. Usually, the advice is to pick one close to you (or the cloud option).library()
will make the commands in the packages you downloaded available to you in the current R session (a new session starts each time R is started and continues until that instance of R is terminated). As suggested this has to be done (when you want to use functions other than those loaded automatically) each time you start R. There is an option to have R load packages automatically on startup by modifying the.RProfile
file (more on that later).
You can accomplish the same thing through the “packages” tab in the files panel. Each package that has been installed has a checkbox next to it. You can load the package by checking the checkbox. There is a search bar to help you locate your package without endless scrolling. There is also an “install” button in the upper right-hand corner of the packages tab. Clicking that will open an install packages dialog where you can type in the name of the package you want to install.
The “object-oriented” nature of R means that you’re generally saving the results of commands into objects that you can access whenever you want and manipulate with other commands. R is a case-sensitive environment, so be careful how you name and access objects in the space and be careful how you call functions lm()
\(\neq\) LM()
.
There are a few tips that don’t really belong anywhere, but are nonetheless important, so I’ll just mention them here and you can refer back when they become relevant.
- In RStudio, if you position your cursor in a line you want to execute (or block text you want to execute), then hit
ctrl+enter
on a PC orcommand+enter
on the mac, the functions will be automatically executed. - You can return to the command you previously entered in the R console by hitting the “up arrow” (similar to “Page Up” in Stata).
- You can find out what directory R is in by typing
getwd()
. - You can set the working directory of R by typing
setwd(path)
wherepath
is the full path to the directory you want to use. The directories must be separated by forward slashes/
and the entire string must be in quotes (either double or single). For example:setwd("C:/users/david/desktop")
. You can also do this through the “Session” dropdown menu where you can select “Set Working Directory” as an option. - To see the values in any object, just type that object’s name into the command window and hit enter (or look in the object browser in RStudio).
Assigning Output to Objects
In this section, we will spend a bit of time on the very basics of coding in R. Some of this may seem tedious, but this is a good way of getting to understand how R works and we won’t spend too long here.
R can be used as a big calculator. By typing 2+2
into R, you will get the following output:
2+2
## [1] 4
After my input of 2+2
, R has provided the output of 4, the evaluation of that mathematical expression. R just prints this output to the console. Doing it this way, the output is not saved per se. Notice, that unlike Stata, you do not have to ask R to “display” anything in Stata, you would have to type display 2+2
to get the same result.
Often times, we want to save the output so we can look at it later. The assignment character in R is <-
(the less-than sign directly followed by the minus sign). You may hear me say “X gets 10,” in R, this would translate to
<- 10
X X
## [1] 10
You can also use the =
as the assignment character. When I started using R, people wren’t doing this, so I haven’t changed over yet, but the following is an equivalent way of specifying the above statement:
= 10
X X
## [1] 10
As with any convention that doesn’t matter much, there are dogmatic adherents on either side of the debate. Some argue that the code is easier to read using the arrow. Others argue that using a single keystroke to produce the assignment character is more efficient. In truth, both are probably right. Your choice is really a matter of taste.
To assign the output of an evaluated function to an object, just put the object on the left-hand side of the arrow and the function on the right-hand side.
<- 4+4 X
Now the object X
contains the evaluation of the expression 4+4
or 8. We see the contents of X
simply by typing its name at the command prompt and hitting enter. In the above command, we’re assigning the output (or result) of the command 4+4
to X
.
X
## [1] 8
There are a few things worth noting here:
- The object does not always save the call - the code that produced the output. Models tend to save this kind of information, but simpler mathematical calculations do not.
- There is no “undo” button and no warning that something is about to be overwritten. The onus is on you to make sure that you keep track of the names of objects you want to keep. Usually, this is not problematic as if you are keeping good track of what you’re doing, you’ll be able to re-generate any previous result easily.
Vectors and Matrices
We can make a vector of numbers by combining a bunch of numbers together using the c()
function. This collects the numbers together in a single object.
<- c(1,2,3,4)
Y Y
## [1] 1 2 3 4
We can also do math on vectors. For example, we can add a scalar and it will add the scalar to each element of the vector:
+3 Y
## [1] 4 5 6 7
The matrix()
function allows us to make matrices.
<- matrix(c(1,2,3,4), nrow=2, ncol=2)
mat1 mat1
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
The default orientation of matrices is in column-major format, meaning that each column of the matrix is filled in until all of the numbers in the vector have been used. You could also fill in by rows by using the byrow=TRUE
argument to the matrix()
function.
<- matrix(c(1,2,3,4), nrow=2, ncol=2, byrow=TRUE)
mat2 mat2
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
Function Arguments
As we will see throughout this book, each R function can take a number of arguments. These arguments and what they do are detailed in the function’s help file. You can access a function’s help file by typing help matrix
or ?matrix
; or, you can click on the “Help” tab in the files pane on the left. You can then use the search bar in the upper-right hand corner of the panel to search for a function’s help file.
Arguments can take lots of forms, but the most common are:
- formula - a model specification that takes the form:
outcome ~ covariate1 + covariate2
for additive functions andoutcome ~ covariate1*covariate2
for conditional or multiplicative functions. - logical - either
TRUE
orFALSE
(must be in all caps), BTW -TRUE
has a numerical value of 1 andFALSE
has a numerical value of 0.
- string - A character string, must be in quotes (single or double quotes are fine, so long as they match).
- data - Generally a data frame or something that can be coerced to a data frame.
- vector or list - allow multiple values to be passed to a single argument.
For example, looking at the help file for the matrix()
function, we see that the first argument data
is a vector, nrow
and ncol
are scalars (single numbers) and byrow
is a logical value.
Importing Data
Before we move on to more complicated operations and more intricacies of dealing with data, the one thing everyone wants to know is - “How do I get my data into R?” As it turns out, the answer is - “quite easily.” There are a number of packages in R that can help read in from and write out to other statistical environments. I personally like the rio
package. This package is actually a wrapper to lots of different packages for importing and exporting data in other formats. It automatically identifies the data type from its extension and uses the correct importer for the data.
Let’s look at a couple of examples.
The dataset we’ll be using here has three variables - x1
, (a numeric variable), x2
(a labeled numeric variable [0=none, 1=some]) and x3
a string variable (“no” and “yes”). I’ve called this dataset r_example.sav
(SPSS) and r_example.dta
(Stata).
R has lots of different data structures available (e.g., arrays, lists, ect…). The one that we are going to be concerned with right now is the data frame; the R terminology for a dataset. A data frame can have different types of variables in it (i.e., character and numeric). It is rectangular (i.e., all rows have the same number of columns and all columns have the same number of rows. There are some more distinctions that make the data frame special, but we’ll talk about those later.
## load rio package - only need to do this once per R session
library(rio)
## load data, note the relative path to the dataset from the current directory
<- import("data/r_example.sav")
spss.dat ## print the contents
spss.dat
## x1 x2 x3
## 1 1 0 yes
## 2 2 0 no
## 3 3 1 no
## 4 4 0 yes
## 5 3 0 no
## 6 4 0 yes
## 7 1 1 yes
## 8 2 1 yes
## 9 5 1 no
## 10 6 0 no
There is also an example Stata dataset, which you could read in as follows:
<- import("data/r_example.dta") stata.dat
Data Types in R
This is a convenient time to talk about different types of data in R. There are basically three different types of variables - numeric variables, factors and character strings.
- Numeric variables would be something like GDP/capita, age or income (in $). Generally, these variables do not contain labels because they have many unique values. Dummy variables are also numeric with values 0 and 1. R will only do mathematical operations on numeric variables (e.g., mean, variance, etc…).
- Factors are variables like social class or party for which you voted. When you think about how to include variables in a model, factors are variables that you would include by making a set of category dummy variables. Factors in R look like numeric variables with value labels in either Stata or SPSS. That is to say that there is a numbering scheme where each unique label value gets a unique number (all non-labeled values are coded as missing). Unlike in those other programs, R will not let you perform mathematical operations on factors.
- Character strings are simply text. There is no numbering scheme with corresponding labels, the value in each cell is simply that cell’s text, not a number with a corresponding label like in a factor.
Using the rio
package, it reads numeric variables with labels as numbers, but it attaches an attribute to the variable called labels
which can be used to turn the variable into a factor. Note the difference in the output below between x2
and x3
- x2
is numeric with a labels
attribute and x3
is a character string (denoted with chr
).
str(spss.dat)
## 'data.frame': 10 obs. of 3 variables:
## $ x1: num 1 2 3 4 3 4 1 2 5 6
## ..- attr(*, "label")= chr "x1"
## ..- attr(*, "format.spss")= chr "F8.2"
## $ x2: num 0 0 1 0 0 0 1 1 1 0
## ..- attr(*, "label")= chr "x2"
## ..- attr(*, "format.spss")= chr "F8.2"
## ..- attr(*, "labels")= Named num [1:2] 0 1
## .. ..- attr(*, "names")= chr [1:2] "none" "some"
## $ x3: chr "yes" "no" "no" "yes" ...
## ..- attr(*, "label")= chr "x3"
## ..- attr(*, "format.spss")= chr "A3"
## ..- attr(*, "display_width")= int 11
To turn x2
into a factor, we could use the function factorize()
that’s in the rio
package.
$x2_fac <- factorize(spss.dat$x2)
spss.datstr(spss.dat)
## 'data.frame': 10 obs. of 4 variables:
## $ x1 : num 1 2 3 4 3 4 1 2 5 6
## ..- attr(*, "label")= chr "x1"
## ..- attr(*, "format.spss")= chr "F8.2"
## $ x2 : num 0 0 1 0 0 0 1 1 1 0
## ..- attr(*, "label")= chr "x2"
## ..- attr(*, "format.spss")= chr "F8.2"
## ..- attr(*, "labels")= Named num [1:2] 0 1
## .. ..- attr(*, "names")= chr [1:2] "none" "some"
## $ x3 : chr "yes" "no" "no" "yes" ...
## ..- attr(*, "label")= chr "x3"
## ..- attr(*, "format.spss")= chr "A3"
## ..- attr(*, "display_width")= int 11
## $ x2_fac: Factor w/ 2 levels "none","some": 1 1 2 1 1 1 2 2 2 1
## ..- attr(*, "label")= chr "x2"
Examining Data
There are a few different methods for examining the properties of your data. The first will tell you what type of data are in your data frame and gives a sense of what some representative values are. The str
command shows the structure of your dataset along with any attributes of the variables that would be otherwise hidden from view.
str(spss.dat)
## 'data.frame': 10 obs. of 4 variables:
## $ x1 : num 1 2 3 4 3 4 1 2 5 6
## ..- attr(*, "label")= chr "x1"
## ..- attr(*, "format.spss")= chr "F8.2"
## $ x2 : num 0 0 1 0 0 0 1 1 1 0
## ..- attr(*, "label")= chr "x2"
## ..- attr(*, "format.spss")= chr "F8.2"
## ..- attr(*, "labels")= Named num [1:2] 0 1
## .. ..- attr(*, "names")= chr [1:2] "none" "some"
## $ x3 : chr "yes" "no" "no" "yes" ...
## ..- attr(*, "label")= chr "x3"
## ..- attr(*, "format.spss")= chr "A3"
## ..- attr(*, "display_width")= int 11
## $ x2_fac: Factor w/ 2 levels "none","some": 1 1 2 1 1 1 2 2 2 1
## ..- attr(*, "label")= chr "x2"
The second method is a more substantive summary. The skimr
package has a function called skim()
that provides a nice summary depending on the variable type.
library(skimr)
skim_tee(spss.dat)
## ── Data Summary ────────────────────────
## Values
## Name data
## Number of rows 10
## Number of columns 4
## _______________________
## Column type frequency:
## character 1
## factor 1
## numeric 2
## ________________________
## Group variables None
##
## ── Variable type: character ────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 x3 0 1 2 3 0 2 0
##
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate ordered n_unique top_counts
## 1 x2_fac 0 1 FALSE 2 non: 6, som: 4
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
## 1 x1 0 1 3.1 1.66 1 2 3 4 6 ▇▃▃▂▂
## 2 x2 0 1 0.4 0.516 0 0 0 1 1 ▇▁▁▁▅
Missing Values
In R, missing data are indicated with NA
(similar to the .
, or .a
, .b
, etc…, in Stata). The dataset r_example_miss.dta
, looks like this in Stata:
. list
+-----------------+
| x1 x2 x3 |
|-----------------|
1. | 1 none yes |
2. | 2 none no |
3. | . some no |
4. | 4 . yes |
5. | 3 none no |
|-----------------|
6. | 4 none yes |
7. | 1 some yes |
8. | 2 some yes |
9. | 5 some no |
10. | 6 none no |
+-----------------+
Notice that it looks like values are missing on all three variables. Let’s read the data into R and see what happens.
<- import("data/r_example_miss.dta")
stata2.dat $x2_fac <- factorize(stata2.dat$x2)
stata2.dat stata2.dat
## x1 x2 x3 x2_fac
## 1 1 0 yes none
## 2 2 0 no none
## 3 NA 1 no some
## 4 4 NA yes <NA>
## 5 3 0 no none
## 6 4 0 yes none
## 7 1 1 yes some
## 8 2 1 yes some
## 9 5 1 no some
## 10 6 0 no none
Notice that the missing elements are NA
.
There are a few different methods for dealing with missing values, though they produce the same statistical result, they have different post-estimation behavior. These are specified through the na.action
argument to modeling commands and you can see how these work by using the help functions: ?na.action
. In lots of the things we do, we will have to give the argument na.rm=TRUE
to remove the missing data from the calculation (i.e., listwise delete).
Filtering with Logical Expressions and Sorting
A logical expression is one that evaluates to either TRUE
(the condition is met) or FALSE
(the condition is not met). There are a few operators you need to know (which are the same as the operators in Stata or SPSS).
- EQUALITY
==
(two equal signs) is the symbol for logical equality.A == B
evaluates toTRUE
ifA
is equivalent toB
and evaluates toFALSE
otherwise. - INEQUALITY
!=
is the command for inequality.A != B
evaluates toTRUE
whenA
is not equivalent toB
. - AND
&
is the conjunction operator.A & B
would evaluate toTRUE
if bothA
andB
were met. It would evaluate toFALSE
if eitherA
and/orB
were not met. - OR
|
(the pipe character) is the logical or operator.A | B
would evaluate toTRUE
if eitherA
and/orB
is met and would evaluate toFALSE
only if neitherA
norB
were met. - NOT
!
(the exclamation point) is the character for logical negation.!(A & B)
is the mirror image of(A & B)
such that the latter evaluates toTRUE
when the former evaluates toFALSE
.
When using these with variables, the conditions for factors and character strings should be specified with characters. With numeric variables, the conditions should be specified using numbers. A few examples will help to illuminate things here.
$x3 == "yes" stata.dat
## [1] TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE
$x2_fac == "none" stata.dat
## logical(0)
$x2 == 1 stata.dat
## [1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
$x1 == 2 stata.dat
## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE