Some people use R like Excel: they open RStudio, load their workspace, and pick up where they left off last time, running commands interactively in the console (perhaps getting these from a .R script that’s used like a word document to store useful commands). This may be you if:
You save your workspace image when you close RStudio
Your scripts are many hundreds or even thousands of lines long
Your data files have names like data_jan16,
data_jan17, data_cleaned_jan17, etc.
You sometimes get different results when you re-run the same command (!)
R should actually be used like Python (or Julia, JavaScript, C, or
any other programming language): scripts are sets of instructions that
should be run in sequence in their entirety to achieve
a well-defined purpose (analysis_jan16.R is not a
well-defined purpose), and the console should be used to debug scripts.
Note that this is not a .R script. It’s an R notebook (.Rmd), which
contains text (like this) alongside code.
The goal of this workshop will be to show you how and why to use R like Python. We’ll specifically focus on preprocessing data, which often turns out to be 80% of the work of data analysis.
R (like other programming languages) has a few basic “atomic” data types. More complex pieces of data (like spreadsheets) are built up from these basic data types:
TRUE and
FALSE'hello' or "goodbye"There are others, but these are the main 3 you’ll encounter.
You can give a piece of data a name to refer back to it later—this is
called assigning it to a variable. The arrow operator <-
(which you can write using alt + hyphen) accomplishes this:
my_bool <- TRUE
print(my_bool)
## [1] TRUE
Atomic pieces of data can be combined into more complex “data structures”. The main 3 data structures you’ll interact with are atomic vectors, lists, and data frames.
Atomic vectors are sequences of atomic pieces of data. They’re
created with c(...) like so:
my_vector <- c(2, 4, 8, 16)
Note that all the elements in an atomic vector have to be of the same type:
bools_and_nums <- c(TRUE, FALSE, 1, 2, 3)
print(bools_and_nums)
## [1] 1 0 1 2 3
and_strings <- c(TRUE, FALSE, 1, 2, 3, 'a', 'b', 'c')
print(and_strings)
## [1] "TRUE"  "FALSE" "1"     "2"     "3"     "a"     "b"     "c"
To index (access) specific elements within a vector, you can use square brackets containing the indices of the elements you want:
print(my_vector)
## [1]  2  4  8 16
my_vector[1] # Single element
## [1] 2
my_vector[2:4] # Sequence of elements
## [1]  4  8 16
my_vector[c(1, 3)] # Only certain elements
## [1] 2 8
my_vector[-c(1, 3)] # All but certain elements
## [1]  4 16
This is called “integer indexing”, but you can also do logical indexing. Here you create a vector of Booleans that’s TRUE for each value you want to access and FALSE for each value you want to ignore:
print(my_vector)
## [1]  2  4  8 16
idx <- my_vector < 5
print(idx)
## [1]  TRUE  TRUE FALSE FALSE
print(my_vector[idx])
## [1] 2 4
Lists are like vectors (they are also sequences of data) but less constrained: their contents don’t have to be atomic and don’t have to be all the same type. Usually in a list, each element is associated with a different name:
my_list <- list(name1 = 'data 1',
                name2 = c(1, 2, 3))
print(my_list)
## $name1
## [1] "data 1"
## 
## $name2
## [1] 1 2 3
Then, to access an element of a list by name, you can use a dollar sign:
print(my_list$name1)
## [1] "data 1"
print(my_list$name2)
## [1] 1 2 3
Data frames are a very important data structure because they’re used to store data read from a spreadsheet. A data frame is like a list of atomic vectors: you access each column the way you’d access elements in a list, and each column is an atomic vector.
my_df <- read.csv('example-data.csv')
print(my_df)
##            rt choice cond
## 1   0.9110855    old    a
## 2   8.4787557    new    b
## 3   8.1044009    new    a
## 4  12.2404803    old    b
## 5   4.3951396    old    a
## 6  13.3365931    old    b
## 7  19.4296267    old    a
## 8   4.8182342    new    b
## 9   3.4131416    old    a
## 10  0.1668238    new    b
print(my_df$rt) # dollar sign
##  [1]  0.9110855  8.4787557  8.1044009 12.2404803  4.3951396 13.3365931
##  [7] 19.4296267  4.8182342  3.4131416  0.1668238
print(my_df$rt[2:4]) # dollar sign and square brackets
## [1]  8.478756  8.104401 12.240480
You can access sections of data frames using a special square brackets notation, where you give first the rows you want (using any index that would work for vectors) followed by the columns you want (as a vector of strings):
row_idx <- c(1, 3)
col_idx <- c('rt', 'cond')
sub_df <- my_df[row_idx, col_idx]
print(sub_df)
##          rt cond
## 1 0.9110855    a
## 3 8.1044009    a
To to access all rows for a given subset of columns, just omit the row index (and vice versa) but keep the comma:
print(my_df[, col_idx]) 
##            rt cond
## 1   0.9110855    a
## 2   8.4787557    b
## 3   8.1044009    a
## 4  12.2404803    b
## 5   4.3951396    a
## 6  13.3365931    b
## 7  19.4296267    a
## 8   4.8182342    b
## 9   3.4131416    a
## 10  0.1668238    b
print(my_df[row_idx, ])
##          rt choice cond
## 1 0.9110855    old    a
## 3 8.1044009    new    a
Spreadsheets should be organized according to the tidy data standard, where each row corresponds to a different observation (at the spreadsheet’s level of observation) and each column corresponds to a different measured variable. Typical examples:
Each row is a different participant. There are columns for age, gender, etc.
Each row is a different trail from an experiment. There are columns for RT, response, condition, etc.
Each row is a different (within-subjects) experimental condition for a different participant. There are columns for things that are specific to the participant (e.g., age) and columns for things are specific to the experimental condition for that participant (e.g., mean RT)
It’s important to be clear about what the unit/level of observation is for a given data frame: in the first example, individual participants are the unit of observation, whereas the second describes trial-level data. Understanding how your data is organized helps you think clearly about it (e.g., avoiding mistakes like the one described in this legendary thread).
In general, it’s best to store any spreadsheet that you might want to read with R in .csv files vs .xlsx files. CSVs are simple text files where side-by-side cells are separated by commas and each row is on a different line. CSVs have 2 main advantages:
They can be opened with any software—you don’t need Excel, let alone a specific version of Excel, nor a special R library for reading Excel files
Excel won’t let you save CSVs with things like coloured cells or formulas or anything else that R won’t know how to read (Excel might give you scary warnings that some data will be lost if you use CSV format—be brave)
To learn how to manipulate data frames, I’ve generated some example data for this workshop. Suppose we’ve run an intervention and collected performance on some memory task at 3 time points. We’ve also collected demographic info. Let’s read the files into R:
demog <- read.csv('demographics.csv')
task_pre <- read.csv('task-pre.csv')
task_mid <- read.csv('task-mid.csv')
task_post <- read.csv('task-post.csv')
As we saw earlier, we can select a subset of rows from a data frame
using square bracket notation. R also has a convenient
subset function that allows us to do this a little more
succinctly when we want to filter out rows that don’t match some
criterion:
# Method 1: square brackets
slow_rt_idx <- task_pre$mean_rt > 1
print(slow_rt_idx)
## [1] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
print(task_pre[slow_rt_idx, ])
##     id  mean_rt ppn_correct
## 2 P007 1.017769   0.6808997
## 7 P003 1.037498   0.5489067
## 8 P006 1.012592   0.5978999
# Method 2: subset function
print(subset(task_pre, mean_rt > 1))
##     id  mean_rt ppn_correct
## 2 P007 1.017769   0.6808997
## 7 P003 1.037498   0.5489067
## 8 P006 1.012592   0.5978999
Note that the subset function can only do logical
indexing. It isn’t used to, for example, access the first row of a data
frame.
One way to combine data frames is to stack them on top of one another
using the rbind function. For example, suppose we want to
combine pre-, mid-, and post-session task data this way:
stacked <- rbind(task_pre, task_mid, task_post)
print(stacked)
##      id   mean_rt ppn_correct
## 1  P004 0.9940302  0.85530451
## 2  P007 1.0177688  0.68089972
## 3  P005 0.8747640  0.71882082
## 4  P002 0.9370050  0.33467732
## 5  P001 0.7810016  0.60323019
## 6  P008 0.8966036  0.49633942
## 7  P003 1.0374975  0.54890673
## 8  P006 1.0125920  0.59789987
## 9  P001 1.0808253  0.29404723
## 10 P007 0.9053562  0.06180917
## 11 P003 0.8050702  0.73247260
## 12 P004 1.1213398  0.98107790
## 13 P006 0.7467665  0.27127975
## 14 P008 0.9261200  0.77565487
## 15 P002 1.0827661  0.73563547
## 16 P005 0.8617424  0.26777398
## 17 P007 0.8357853  0.51299325
## 18 P001 0.9761309  0.75279619
## 19 P003 0.8089268  0.76735500
## 20 P005 0.7655857  0.14172428
## 21 P006 0.9018639  0.66236845
## 22 P004 0.8351172  0.84887690
## 23 P002 1.0295283  0.74715848
## 24 P008 0.7821590  0.51639406
Note that the rbind function expects that all data
frames being stacked have the same column names and it will line them up
automatically (the columns don’t have to be in the same order).
We named the individual tables task_pre,
task_mid, and task_post, but in our stacked
table we have no way of knowing if an individual observation comes from
the pre-session, mid-session, post-session. Therefore we need to encode
this in a new column which we can call “session”
task_pre$session <- "pre"
print(task_pre)
##     id   mean_rt ppn_correct session
## 1 P004 0.9940302   0.8553045     pre
## 2 P007 1.0177688   0.6808997     pre
## 3 P005 0.8747640   0.7188208     pre
## 4 P002 0.9370050   0.3346773     pre
## 5 P001 0.7810016   0.6032302     pre
## 6 P008 0.8966036   0.4963394     pre
## 7 P003 1.0374975   0.5489067     pre
## 8 P006 1.0125920   0.5978999     pre
task_mid$session <- "mid"
print(task_mid)
##     id   mean_rt ppn_correct session
## 1 P001 1.0808253  0.29404723     mid
## 2 P007 0.9053562  0.06180917     mid
## 3 P003 0.8050702  0.73247260     mid
## 4 P004 1.1213398  0.98107790     mid
## 5 P006 0.7467665  0.27127975     mid
## 6 P008 0.9261200  0.77565487     mid
## 7 P002 1.0827661  0.73563547     mid
## 8 P005 0.8617424  0.26777398     mid
task_post$session <- "post"
print(task_post)
##     id   mean_rt ppn_correct session
## 1 P007 0.8357853   0.5129933    post
## 2 P001 0.9761309   0.7527962    post
## 3 P003 0.8089268   0.7673550    post
## 4 P005 0.7655857   0.1417243    post
## 5 P006 0.9018639   0.6623684    post
## 6 P004 0.8351172   0.8488769    post
## 7 P002 1.0295283   0.7471585    post
## 8 P008 0.7821590   0.5163941    post
stacked <- rbind(task_pre, task_mid, task_post)
print(stacked)
##      id   mean_rt ppn_correct session
## 1  P004 0.9940302  0.85530451     pre
## 2  P007 1.0177688  0.68089972     pre
## 3  P005 0.8747640  0.71882082     pre
## 4  P002 0.9370050  0.33467732     pre
## 5  P001 0.7810016  0.60323019     pre
## 6  P008 0.8966036  0.49633942     pre
## 7  P003 1.0374975  0.54890673     pre
## 8  P006 1.0125920  0.59789987     pre
## 9  P001 1.0808253  0.29404723     mid
## 10 P007 0.9053562  0.06180917     mid
## 11 P003 0.8050702  0.73247260     mid
## 12 P004 1.1213398  0.98107790     mid
## 13 P006 0.7467665  0.27127975     mid
## 14 P008 0.9261200  0.77565487     mid
## 15 P002 1.0827661  0.73563547     mid
## 16 P005 0.8617424  0.26777398     mid
## 17 P007 0.8357853  0.51299325    post
## 18 P001 0.9761309  0.75279619    post
## 19 P003 0.8089268  0.76735500    post
## 20 P005 0.7655857  0.14172428    post
## 21 P006 0.9018639  0.66236845    post
## 22 P004 0.8351172  0.84887690    post
## 23 P002 1.0295283  0.74715848    post
## 24 P008 0.7821590  0.51639406    post
Note that even though we provided single strings ("pre",
"mid", and "post") to create the new columns,
those single strings ended up being repeated for as many times as was
necessary to fill every row of the data frame. We can also provide a
vector that has as many elements as there are rows in the data
frame—then the new column is populated with the elements of that vector,
one element per row:
total_corr <- task_pre$ppn_correct * 30
print(total_corr)
## [1] 25.65914 20.42699 21.56462 10.04032 18.09691 14.89018 16.46720 17.93700
task_pre$total_correct <- total_corr
print(task_pre)
##     id   mean_rt ppn_correct session total_correct
## 1 P004 0.9940302   0.8553045     pre      25.65914
## 2 P007 1.0177688   0.6808997     pre      20.42699
## 3 P005 0.8747640   0.7188208     pre      21.56462
## 4 P002 0.9370050   0.3346773     pre      10.04032
## 5 P001 0.7810016   0.6032302     pre      18.09691
## 6 P008 0.8966036   0.4963394     pre      14.89018
## 7 P003 1.0374975   0.5489067     pre      16.46720
## 8 P006 1.0125920   0.5978999     pre      17.93700
Sometimes your data is spread across multiple spreadsheets (no pun
intended). For example, demographic data is stored in one spreadsheet,
task scores in another. Combining these spreadsheets is more complicated
than just copying and pasting them next to each other: you have to make
sure the rows line up properly using the merge
function.
To merge two data frames, you have to specify which columns contain the key that should be used to match up the rows. In this case, both the demographics and the task data have individual participants as their unit of observation, so we specify that the participant IDs should be used to line them up.
merged <- merge(demog, stacked, by = 'id')
print(merged)
##      id age   mean_rt ppn_correct session
## 1  P001  73 0.9761309  0.75279619    post
## 2  P001  73 0.7810016  0.60323019     pre
## 3  P001  73 1.0808253  0.29404723     mid
## 4  P002  73 0.9370050  0.33467732     pre
## 5  P002  73 1.0827661  0.73563547     mid
## 6  P002  73 1.0295283  0.74715848    post
## 7  P003  77 1.0374975  0.54890673     pre
## 8  P003  77 0.8050702  0.73247260     mid
## 9  P003  77 0.8089268  0.76735500    post
## 10 P004  71 0.9940302  0.85530451     pre
## 11 P004  71 0.8351172  0.84887690    post
## 12 P004  71 1.1213398  0.98107790     mid
## 13 P005  74 0.8747640  0.71882082     pre
## 14 P005  74 0.8617424  0.26777398     mid
## 15 P005  74 0.7655857  0.14172428    post
## 16 P006  71 0.7467665  0.27127975     mid
## 17 P006  71 0.9018639  0.66236845    post
## 18 P006  71 1.0125920  0.59789987     pre
## 19 P007  85 1.0177688  0.68089972     pre
## 20 P007  85 0.8357853  0.51299325    post
## 21 P007  85 0.9053562  0.06180917     mid
## 22 P008  74 0.8966036  0.49633942     pre
## 23 P008  74 0.9261200  0.77565487     mid
## 24 P008  74 0.7821590  0.51639406    post
Note that the merging was possible even though stacked
has more rows than demog.
Note also that if you merge data frames that share a column name, and
you aren’t using that column name to merge them, the resulting data
frame will have 2 copies of that column called
<column-name>.x and
<column-name>.y.
demog$is_pilot <- FALSE
task_pre$is_pilot <- FALSE
merged_2 <- merge(demog, task_pre, by = 'id')
print(merged_2)
##     id age is_pilot.x   mean_rt ppn_correct session total_correct is_pilot.y
## 1 P001  73      FALSE 0.7810016   0.6032302     pre      18.09691      FALSE
## 2 P002  73      FALSE 0.9370050   0.3346773     pre      10.04032      FALSE
## 3 P003  77      FALSE 1.0374975   0.5489067     pre      16.46720      FALSE
## 4 P004  71      FALSE 0.9940302   0.8553045     pre      25.65914      FALSE
## 5 P005  74      FALSE 0.8747640   0.7188208     pre      21.56462      FALSE
## 6 P006  71      FALSE 1.0125920   0.5978999     pre      17.93700      FALSE
## 7 P007  85      FALSE 1.0177688   0.6808997     pre      20.42699      FALSE
## 8 P008  74      FALSE 0.8966036   0.4963394     pre      14.89018      FALSE
To avoid this, you could either use the redundant column to merge
(by = c('id', 'is_pilot')) or delete it from one of the
data frames.
You will often want to average over observations in your dataset. The
base R function for this is called aggregate, although
there are also the popular group_by and
summarize functions provided by the dplyr
library.
by_age <- aggregate(mean_rt ~ age, data = merged, FUN = median)
print(by_age)
##   age   mean_rt
## 1  71 0.9479471
## 2  73 1.0028296
## 3  74 0.8682532
## 4  77 0.8089268
## 5  85 0.9053562
by_session_and_age <- aggregate(ppn_correct ~ session + age, data = merged, FUN = mean)
print(by_session_and_age)
##    session age ppn_correct
## 1      mid  71  0.62617883
## 2     post  71  0.75562268
## 3      pre  71  0.72660219
## 4      mid  73  0.51484135
## 5     post  73  0.74997733
## 6      pre  73  0.46895376
## 7      mid  74  0.52171442
## 8     post  74  0.32905917
## 9      pre  74  0.60758012
## 10     mid  77  0.73247260
## 11    post  77  0.76735500
## 12     pre  77  0.54890673
## 13     mid  85  0.06180917
## 14    post  85  0.51299325
## 15     pre  85  0.68089972
Data can be described as (relatively) wide or long. In long data, the units of observation are relatively granular. For example, in our stacked dataset the unit of analysis is the experimental session for a particular participant. As a result, long data contains relatively more rows. In wide data, the units of observation are relatively abstract or superordinate. For example, we might have a data frame where each row is a participant and the task scores for the pre- and post-sessions are stored in separate columns. As a result, wide data contains relatively more columns.
We can move between wide and long format using the
reshape function in base R (although, as with
aggregate, the popular library dplyr has
multiple replacements for this).
To make a data frame wider, we need to specify which column
differentiates observations at the current level of observation (the
timevar, in this case session) and which
columns contain measurements at the intended level of observation (the
idvar, in this case id and age).
All variables not in idvar are assumed to be made at the
current, more granular, level of observation.
wide <- reshape(merged, direction = 'wide',
                timevar = 'session',
                idvar = c('id', 'age')
)
print(names(wide))
## [1] "id"               "age"              "mean_rt.post"     "ppn_correct.post"
## [5] "mean_rt.pre"      "ppn_correct.pre"  "mean_rt.mid"      "ppn_correct.mid"
Note that the term is “time var” because repeated observations
(longitudinal data) are a paradigmatic case of long data, but the
reshape function could also be used to differentiate
observations from, e.g., different experimental conditions.
Wide data is often useful for computing difference scores:
diff_scores <- wide$mean_rt.post - wide$mean_rt.pre
hist(diff_scores)
t.test(diff_scores)
## 
##  One Sample t-test
## 
## data:  diff_scores
## t = -1.5029, df = 7, p-value = 0.1766
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.19820158  0.04416014
## sample estimates:
##   mean of x 
## -0.07702072
Reshaping from wide to long is less simple, because it’s based on
reading values from column names (e.g., pre from
mean_rt.pre). You have to provide the names of all the
columns containing measurements that vary by the new (more granular)
level of observation (varying) along with the name of the
new level of observation (timevar again).
Sometimes the column names in varying can be correctly
parsed by R automatically, but to be safe you can also provide the
specific names of the variables that were measured multiple times
(v.names) along with the possible values that
timevar can take on:
long <- reshape(wide, direction = 'long',
                varying = c('mean_rt.pre',
                            'mean_rt.mid',
                            'mean_rt.post',
                            'ppn_correct.pre',
                            'ppn_correct.mid',
                            'ppn_correct.post'),
                timevar = 'session',
                v.names = c('mean_rt', 'ppn_correct'),
                times = c('pre', 'mid', 'post'))
My general advice would be to use long format for long-term storage and reshape to wide as needed for analysis. This helps you avoids headaches like the above.
In general, code should not be repetitive. If you find yourself copying and pasting the same piece of code and making small changes, this is a sign that you should use a loop instead. Loops allow you to iterate over a sequence of values and apply the same set of operations to each of them. That sounds very general because it is! Loops are very generally applicable. Let’s make it more concrete:
sequence_of_vals <- c(1, 2, 4, 8, 16)
for (current_val in sequence_of_vals) {
  # First operation: increment
  plus_one <- current_val + 1
  # Second operation: print
  print(plus_one)
}
## [1] 2
## [1] 3
## [1] 5
## [1] 9
## [1] 17
Here are some common research scenarios where you might use loops:
Some people might argue that you should use the apply
family of functions for these tasks (particularly #2), but I often find
it easiest to just write a loop.
In the last part of the workshop, we’ll do some live coding to compile data files from multiple participants and sessions. But first…
Suppose, for whatever reason, we want to re-stack the session-specific task data frames. Let’s scroll up to the stacking section and re-run the code there.
Why would we get an error running code that worked before?
It’s because we altered the task_pre table in the
section on merging
(task_pre$total_correct <- total_corr). When we go back
and re-run the old stacking code, it’s using the altered
task_pre data frame.
This is a mistake that’s very easy to make when you’re using R like Excel. Scrolling up to re-run old code creates the illusion that you’re turning back the clock, but in fact there is a single, current “environment” of variables in R (see the panel to the right) that determines what data your code is referring to at the time you run it. When you run old code in a new environment, all bets are off.
A spooky cautionary story:
You’re using R like Excel one day and you get an exciting result. Your supervisor isn’t around at the moment but you make a mental note to show them and you drop the command to generate the result into your script. You go on doing analysis until your supervisor walks by. You scroll up and re-run the command, but now it returns an error—probably an old-code-new-environment error. This means you have to re-run your whole script, but it’s 1000 lines long so that will be slow. Your supervisor tells you to call them back when you’re ready. You re-run the script, but your exciting result is still nowhere to be found—it must have been the result of an old-code-new-environment error too 😱
Operator precedence (order of operations, especially for Booleans)