Some people use R like Excel: they open RStudio, load their workspace, and pick up where they left off last time, running commands interactively in the console (perhaps getting these from a .R script that’s used like a word document to store useful commands). This may be you if:

R should actually be used like Python (or Julia, JavaScript, C, or any other programming language): scripts are sets of instructions that should be run in sequence in their entirety to achieve a well-defined purpose (analysis_jan16.R is not a well-defined purpose), and the console should be used to debug scripts. Note that this is not a .R script. It’s an R notebook (.Rmd), which contains text (like this) alongside code.

The goal of this workshop will be to show you how and why to use R like Python. We’ll specifically focus on preprocessing data, which often turns out to be 80% of the work of data analysis.

Atomic data types

R (like other programming languages) has a few basic “atomic” data types. More complex pieces of data (like spreadsheets) are built up from these basic data types:

  1. Booleans: true/false values, represented by TRUE and FALSE
  2. Numbers
  3. Strings: text, enclosed by single or double quotes—e.g., 'hello' or "goodbye"

There are others, but these are the main 3 you’ll encounter.

Variables

You can give a piece of data a name to refer back to it later—this is called assigning it to a variable. The arrow operator <- (which you can write using alt + hyphen) accomplishes this:

my_bool <- TRUE
print(my_bool)
## [1] TRUE

Data structures

Atomic pieces of data can be combined into more complex “data structures”. The main 3 data structures you’ll interact with are atomic vectors, lists, and data frames.

Atomic vectors

Atomic vectors are sequences of atomic pieces of data. They’re created with c(...) like so:

my_vector <- c(2, 4, 8, 16)

Note that all the elements in an atomic vector have to be of the same type:

bools_and_nums <- c(TRUE, FALSE, 1, 2, 3)
print(bools_and_nums)
## [1] 1 0 1 2 3
and_strings <- c(TRUE, FALSE, 1, 2, 3, 'a', 'b', 'c')
print(and_strings)
## [1] "TRUE"  "FALSE" "1"     "2"     "3"     "a"     "b"     "c"

To index (access) specific elements within a vector, you can use square brackets containing the indices of the elements you want:

print(my_vector)
## [1]  2  4  8 16
my_vector[1] # Single element
## [1] 2
my_vector[2:4] # Sequence of elements
## [1]  4  8 16
my_vector[c(1, 3)] # Only certain elements
## [1] 2 8
my_vector[-c(1, 3)] # All but certain elements
## [1]  4 16

This is called “integer indexing”, but you can also do logical indexing. Here you create a vector of Booleans that’s TRUE for each value you want to access and FALSE for each value you want to ignore:

print(my_vector)
## [1]  2  4  8 16
idx <- my_vector < 5
print(idx)
## [1]  TRUE  TRUE FALSE FALSE
print(my_vector[idx])
## [1] 2 4

Lists

Lists are like vectors (they are also sequences of data) but less constrained: their contents don’t have to be atomic and don’t have to be all the same type. Usually in a list, each element is associated with a different name:

my_list <- list(name1 = 'data 1',
                name2 = c(1, 2, 3))
print(my_list)
## $name1
## [1] "data 1"
## 
## $name2
## [1] 1 2 3

Then, to access an element of a list by name, you can use a dollar sign:

print(my_list$name1)
## [1] "data 1"
print(my_list$name2)
## [1] 1 2 3

Data frames

Data frames are a very important data structure because they’re used to store data read from a spreadsheet. A data frame is like a list of atomic vectors: you access each column the way you’d access elements in a list, and each column is an atomic vector.

my_df <- read.csv('example-data.csv')
print(my_df)
##            rt choice cond
## 1   0.9110855    old    a
## 2   8.4787557    new    b
## 3   8.1044009    new    a
## 4  12.2404803    old    b
## 5   4.3951396    old    a
## 6  13.3365931    old    b
## 7  19.4296267    old    a
## 8   4.8182342    new    b
## 9   3.4131416    old    a
## 10  0.1668238    new    b
print(my_df$rt) # dollar sign
##  [1]  0.9110855  8.4787557  8.1044009 12.2404803  4.3951396 13.3365931
##  [7] 19.4296267  4.8182342  3.4131416  0.1668238
print(my_df$rt[2:4]) # dollar sign and square brackets
## [1]  8.478756  8.104401 12.240480

You can access sections of data frames using a special square brackets notation, where you give first the rows you want (using any index that would work for vectors) followed by the columns you want (as a vector of strings):

row_idx <- c(1, 3)
col_idx <- c('rt', 'cond')
sub_df <- my_df[row_idx, col_idx]
print(sub_df)
##          rt cond
## 1 0.9110855    a
## 3 8.1044009    a

To to access all rows for a given subset of columns, just omit the row index (and vice versa) but keep the comma:

print(my_df[, col_idx]) 
##            rt cond
## 1   0.9110855    a
## 2   8.4787557    b
## 3   8.1044009    a
## 4  12.2404803    b
## 5   4.3951396    a
## 6  13.3365931    b
## 7  19.4296267    a
## 8   4.8182342    b
## 9   3.4131416    a
## 10  0.1668238    b
print(my_df[row_idx, ])
##          rt choice cond
## 1 0.9110855    old    a
## 3 8.1044009    new    a

More about data frames

“Tidy data” and units/levels of observation

Spreadsheets should be organized according to the tidy data standard, where each row corresponds to a different observation (at the spreadsheet’s level of observation) and each column corresponds to a different measured variable. Typical examples:

  • Each row is a different participant. There are columns for age, gender, etc.

  • Each row is a different trail from an experiment. There are columns for RT, response, condition, etc.

  • Each row is a different (within-subjects) experimental condition for a different participant. There are columns for things that are specific to the participant (e.g., age) and columns for things are specific to the experimental condition for that participant (e.g., mean RT)

It’s important to be clear about what the unit/level of observation is for a given data frame: in the first example, individual participants are the unit of observation, whereas the second describes trial-level data. Understanding how your data is organized helps you think clearly about it (e.g., avoiding mistakes like the one described in this legendary thread).

Storage formats

In general, it’s best to store any spreadsheet that you might want to read with R in .csv files vs .xlsx files. CSVs are simple text files where side-by-side cells are separated by commas and each row is on a different line. CSVs have 2 main advantages:

  • They can be opened with any software—you don’t need Excel, let alone a specific version of Excel, nor a special R library for reading Excel files

  • Excel won’t let you save CSVs with things like coloured cells or formulas or anything else that R won’t know how to read (Excel might give you scary warnings that some data will be lost if you use CSV format—be brave)

Our example data

To learn how to manipulate data frames, I’ve generated some example data for this workshop. Suppose we’ve run an intervention and collected performance on some memory task at 3 time points. We’ve also collected demographic info. Let’s read the files into R:

demog <- read.csv('demographics.csv')
task_pre <- read.csv('task-pre.csv')
task_mid <- read.csv('task-mid.csv')
task_post <- read.csv('task-post.csv')

Data frame operations

Subsetting

As we saw earlier, we can select a subset of rows from a data frame using square bracket notation. R also has a convenient subset function that allows us to do this a little more succinctly when we want to filter out rows that don’t match some criterion:

# Method 1: square brackets
slow_rt_idx <- task_pre$mean_rt > 1
print(slow_rt_idx)
## [1] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
print(task_pre[slow_rt_idx, ])
##     id  mean_rt ppn_correct
## 2 P007 1.017769   0.6808997
## 7 P003 1.037498   0.5489067
## 8 P006 1.012592   0.5978999
# Method 2: subset function
print(subset(task_pre, mean_rt > 1))
##     id  mean_rt ppn_correct
## 2 P007 1.017769   0.6808997
## 7 P003 1.037498   0.5489067
## 8 P006 1.012592   0.5978999

Note that the subset function can only do logical indexing. It isn’t used to, for example, access the first row of a data frame.

Stacking

One way to combine data frames is to stack them on top of one another using the rbind function. For example, suppose we want to combine pre-, mid-, and post-session task data this way:

stacked <- rbind(task_pre, task_mid, task_post)
print(stacked)
##      id   mean_rt ppn_correct
## 1  P004 0.9940302  0.85530451
## 2  P007 1.0177688  0.68089972
## 3  P005 0.8747640  0.71882082
## 4  P002 0.9370050  0.33467732
## 5  P001 0.7810016  0.60323019
## 6  P008 0.8966036  0.49633942
## 7  P003 1.0374975  0.54890673
## 8  P006 1.0125920  0.59789987
## 9  P001 1.0808253  0.29404723
## 10 P007 0.9053562  0.06180917
## 11 P003 0.8050702  0.73247260
## 12 P004 1.1213398  0.98107790
## 13 P006 0.7467665  0.27127975
## 14 P008 0.9261200  0.77565487
## 15 P002 1.0827661  0.73563547
## 16 P005 0.8617424  0.26777398
## 17 P007 0.8357853  0.51299325
## 18 P001 0.9761309  0.75279619
## 19 P003 0.8089268  0.76735500
## 20 P005 0.7655857  0.14172428
## 21 P006 0.9018639  0.66236845
## 22 P004 0.8351172  0.84887690
## 23 P002 1.0295283  0.74715848
## 24 P008 0.7821590  0.51639406

Note that the rbind function expects that all data frames being stacked have the same column names and it will line them up automatically (the columns don’t have to be in the same order).

We named the individual tables task_pre, task_mid, and task_post, but in our stacked table we have no way of knowing if an individual observation comes from the pre-session, mid-session, post-session. Therefore we need to encode this in a new column which we can call “session”

task_pre$session <- "pre"
print(task_pre)
##     id   mean_rt ppn_correct session
## 1 P004 0.9940302   0.8553045     pre
## 2 P007 1.0177688   0.6808997     pre
## 3 P005 0.8747640   0.7188208     pre
## 4 P002 0.9370050   0.3346773     pre
## 5 P001 0.7810016   0.6032302     pre
## 6 P008 0.8966036   0.4963394     pre
## 7 P003 1.0374975   0.5489067     pre
## 8 P006 1.0125920   0.5978999     pre
task_mid$session <- "mid"
print(task_mid)
##     id   mean_rt ppn_correct session
## 1 P001 1.0808253  0.29404723     mid
## 2 P007 0.9053562  0.06180917     mid
## 3 P003 0.8050702  0.73247260     mid
## 4 P004 1.1213398  0.98107790     mid
## 5 P006 0.7467665  0.27127975     mid
## 6 P008 0.9261200  0.77565487     mid
## 7 P002 1.0827661  0.73563547     mid
## 8 P005 0.8617424  0.26777398     mid
task_post$session <- "post"
print(task_post)
##     id   mean_rt ppn_correct session
## 1 P007 0.8357853   0.5129933    post
## 2 P001 0.9761309   0.7527962    post
## 3 P003 0.8089268   0.7673550    post
## 4 P005 0.7655857   0.1417243    post
## 5 P006 0.9018639   0.6623684    post
## 6 P004 0.8351172   0.8488769    post
## 7 P002 1.0295283   0.7471585    post
## 8 P008 0.7821590   0.5163941    post
stacked <- rbind(task_pre, task_mid, task_post)
print(stacked)
##      id   mean_rt ppn_correct session
## 1  P004 0.9940302  0.85530451     pre
## 2  P007 1.0177688  0.68089972     pre
## 3  P005 0.8747640  0.71882082     pre
## 4  P002 0.9370050  0.33467732     pre
## 5  P001 0.7810016  0.60323019     pre
## 6  P008 0.8966036  0.49633942     pre
## 7  P003 1.0374975  0.54890673     pre
## 8  P006 1.0125920  0.59789987     pre
## 9  P001 1.0808253  0.29404723     mid
## 10 P007 0.9053562  0.06180917     mid
## 11 P003 0.8050702  0.73247260     mid
## 12 P004 1.1213398  0.98107790     mid
## 13 P006 0.7467665  0.27127975     mid
## 14 P008 0.9261200  0.77565487     mid
## 15 P002 1.0827661  0.73563547     mid
## 16 P005 0.8617424  0.26777398     mid
## 17 P007 0.8357853  0.51299325    post
## 18 P001 0.9761309  0.75279619    post
## 19 P003 0.8089268  0.76735500    post
## 20 P005 0.7655857  0.14172428    post
## 21 P006 0.9018639  0.66236845    post
## 22 P004 0.8351172  0.84887690    post
## 23 P002 1.0295283  0.74715848    post
## 24 P008 0.7821590  0.51639406    post

Note that even though we provided single strings ("pre", "mid", and "post") to create the new columns, those single strings ended up being repeated for as many times as was necessary to fill every row of the data frame. We can also provide a vector that has as many elements as there are rows in the data frame—then the new column is populated with the elements of that vector, one element per row:

total_corr <- task_pre$ppn_correct * 30
print(total_corr)
## [1] 25.65914 20.42699 21.56462 10.04032 18.09691 14.89018 16.46720 17.93700
task_pre$total_correct <- total_corr
print(task_pre)
##     id   mean_rt ppn_correct session total_correct
## 1 P004 0.9940302   0.8553045     pre      25.65914
## 2 P007 1.0177688   0.6808997     pre      20.42699
## 3 P005 0.8747640   0.7188208     pre      21.56462
## 4 P002 0.9370050   0.3346773     pre      10.04032
## 5 P001 0.7810016   0.6032302     pre      18.09691
## 6 P008 0.8966036   0.4963394     pre      14.89018
## 7 P003 1.0374975   0.5489067     pre      16.46720
## 8 P006 1.0125920   0.5978999     pre      17.93700

Merging

Sometimes your data is spread across multiple spreadsheets (no pun intended). For example, demographic data is stored in one spreadsheet, task scores in another. Combining these spreadsheets is more complicated than just copying and pasting them next to each other: you have to make sure the rows line up properly using the merge function.

To merge two data frames, you have to specify which columns contain the key that should be used to match up the rows. In this case, both the demographics and the task data have individual participants as their unit of observation, so we specify that the participant IDs should be used to line them up.

merged <- merge(demog, stacked, by = 'id')
print(merged)
##      id age   mean_rt ppn_correct session
## 1  P001  73 0.9761309  0.75279619    post
## 2  P001  73 0.7810016  0.60323019     pre
## 3  P001  73 1.0808253  0.29404723     mid
## 4  P002  73 0.9370050  0.33467732     pre
## 5  P002  73 1.0827661  0.73563547     mid
## 6  P002  73 1.0295283  0.74715848    post
## 7  P003  77 1.0374975  0.54890673     pre
## 8  P003  77 0.8050702  0.73247260     mid
## 9  P003  77 0.8089268  0.76735500    post
## 10 P004  71 0.9940302  0.85530451     pre
## 11 P004  71 0.8351172  0.84887690    post
## 12 P004  71 1.1213398  0.98107790     mid
## 13 P005  74 0.8747640  0.71882082     pre
## 14 P005  74 0.8617424  0.26777398     mid
## 15 P005  74 0.7655857  0.14172428    post
## 16 P006  71 0.7467665  0.27127975     mid
## 17 P006  71 0.9018639  0.66236845    post
## 18 P006  71 1.0125920  0.59789987     pre
## 19 P007  85 1.0177688  0.68089972     pre
## 20 P007  85 0.8357853  0.51299325    post
## 21 P007  85 0.9053562  0.06180917     mid
## 22 P008  74 0.8966036  0.49633942     pre
## 23 P008  74 0.9261200  0.77565487     mid
## 24 P008  74 0.7821590  0.51639406    post

Note that the merging was possible even though stacked has more rows than demog.

Note also that if you merge data frames that share a column name, and you aren’t using that column name to merge them, the resulting data frame will have 2 copies of that column called <column-name>.x and <column-name>.y.

demog$is_pilot <- FALSE
task_pre$is_pilot <- FALSE
merged_2 <- merge(demog, task_pre, by = 'id')
print(merged_2)
##     id age is_pilot.x   mean_rt ppn_correct session total_correct is_pilot.y
## 1 P001  73      FALSE 0.7810016   0.6032302     pre      18.09691      FALSE
## 2 P002  73      FALSE 0.9370050   0.3346773     pre      10.04032      FALSE
## 3 P003  77      FALSE 1.0374975   0.5489067     pre      16.46720      FALSE
## 4 P004  71      FALSE 0.9940302   0.8553045     pre      25.65914      FALSE
## 5 P005  74      FALSE 0.8747640   0.7188208     pre      21.56462      FALSE
## 6 P006  71      FALSE 1.0125920   0.5978999     pre      17.93700      FALSE
## 7 P007  85      FALSE 1.0177688   0.6808997     pre      20.42699      FALSE
## 8 P008  74      FALSE 0.8966036   0.4963394     pre      14.89018      FALSE

To avoid this, you could either use the redundant column to merge (by = c('id', 'is_pilot')) or delete it from one of the data frames.

Aggregating

You will often want to average over observations in your dataset. The base R function for this is called aggregate, although there are also the popular group_by and summarize functions provided by the dplyr library.

by_age <- aggregate(mean_rt ~ age, data = merged, FUN = median)
print(by_age)
##   age   mean_rt
## 1  71 0.9479471
## 2  73 1.0028296
## 3  74 0.8682532
## 4  77 0.8089268
## 5  85 0.9053562
by_session_and_age <- aggregate(ppn_correct ~ session + age, data = merged, FUN = mean)
print(by_session_and_age)
##    session age ppn_correct
## 1      mid  71  0.62617883
## 2     post  71  0.75562268
## 3      pre  71  0.72660219
## 4      mid  73  0.51484135
## 5     post  73  0.74997733
## 6      pre  73  0.46895376
## 7      mid  74  0.52171442
## 8     post  74  0.32905917
## 9      pre  74  0.60758012
## 10     mid  77  0.73247260
## 11    post  77  0.76735500
## 12     pre  77  0.54890673
## 13     mid  85  0.06180917
## 14    post  85  0.51299325
## 15     pre  85  0.68089972

Reshaping long to wide

Data can be described as (relatively) wide or long. In long data, the units of observation are relatively granular. For example, in our stacked dataset the unit of analysis is the experimental session for a particular participant. As a result, long data contains relatively more rows. In wide data, the units of observation are relatively abstract or superordinate. For example, we might have a data frame where each row is a participant and the task scores for the pre- and post-sessions are stored in separate columns. As a result, wide data contains relatively more columns.

We can move between wide and long format using the reshape function in base R (although, as with aggregate, the popular library dplyr has multiple replacements for this).

To make a data frame wider, we need to specify which column differentiates observations at the current level of observation (the timevar, in this case session) and which columns contain measurements at the intended level of observation (the idvar, in this case id and age). All variables not in idvar are assumed to be made at the current, more granular, level of observation.

wide <- reshape(merged, direction = 'wide',
                timevar = 'session',
                idvar = c('id', 'age')
)
print(names(wide))
## [1] "id"               "age"              "mean_rt.post"     "ppn_correct.post"
## [5] "mean_rt.pre"      "ppn_correct.pre"  "mean_rt.mid"      "ppn_correct.mid"

Note that the term is “time var” because repeated observations (longitudinal data) are a paradigmatic case of long data, but the reshape function could also be used to differentiate observations from, e.g., different experimental conditions.

Wide data is often useful for computing difference scores:

diff_scores <- wide$mean_rt.post - wide$mean_rt.pre
hist(diff_scores)

t.test(diff_scores)
## 
##  One Sample t-test
## 
## data:  diff_scores
## t = -1.5029, df = 7, p-value = 0.1766
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.19820158  0.04416014
## sample estimates:
##   mean of x 
## -0.07702072

Reshaping wide to long

Reshaping from wide to long is less simple, because it’s based on reading values from column names (e.g., pre from mean_rt.pre). You have to provide the names of all the columns containing measurements that vary by the new (more granular) level of observation (varying) along with the name of the new level of observation (timevar again).

Sometimes the column names in varying can be correctly parsed by R automatically, but to be safe you can also provide the specific names of the variables that were measured multiple times (v.names) along with the possible values that timevar can take on:

long <- reshape(wide, direction = 'long',
                varying = c('mean_rt.pre',
                            'mean_rt.mid',
                            'mean_rt.post',
                            'ppn_correct.pre',
                            'ppn_correct.mid',
                            'ppn_correct.post'),
                timevar = 'session',
                v.names = c('mean_rt', 'ppn_correct'),
                times = c('pre', 'mid', 'post'))

My general advice would be to use long format for long-term storage and reshape to wide as needed for analysis. This helps you avoids headaches like the above.

Loops

In general, code should not be repetitive. If you find yourself copying and pasting the same piece of code and making small changes, this is a sign that you should use a loop instead. Loops allow you to iterate over a sequence of values and apply the same set of operations to each of them. That sounds very general because it is! Loops are very generally applicable. Let’s make it more concrete:

sequence_of_vals <- c(1, 2, 4, 8, 16)
for (current_val in sequence_of_vals) {
  # First operation: increment
  plus_one <- current_val + 1
  # Second operation: print
  print(plus_one)
}
## [1] 2
## [1] 3
## [1] 5
## [1] 9
## [1] 17

Here are some common research scenarios where you might use loops:

  1. Doing the same preprocessing on many different participant- and/or task-specific data files
  2. Computing the same summary statistics for many different outcome variables
  3. Fitting the same statistical model with many different outcome/predictor variables

Some people might argue that you should use the apply family of functions for these tasks (particularly #2), but I often find it easiest to just write a loop.

In the last part of the workshop, we’ll do some live coding to compile data files from multiple participants and sessions. But first…

An important aside

Suppose, for whatever reason, we want to re-stack the session-specific task data frames. Let’s scroll up to the stacking section and re-run the code there.

Why would we get an error running code that worked before?

It’s because we altered the task_pre table in the section on merging (task_pre$total_correct <- total_corr). When we go back and re-run the old stacking code, it’s using the altered task_pre data frame.

This is a mistake that’s very easy to make when you’re using R like Excel. Scrolling up to re-run old code creates the illusion that you’re turning back the clock, but in fact there is a single, current “environment” of variables in R (see the panel to the right) that determines what data your code is referring to at the time you run it. When you run old code in a new environment, all bets are off.

A spooky cautionary story:

You’re using R like Excel one day and you get an exciting result. Your supervisor isn’t around at the moment but you make a mental note to show them and you drop the command to generate the result into your script. You go on doing analysis until your supervisor walks by. You scroll up and re-run the command, but now it returns an error—probably an old-code-new-environment error. This means you have to re-run your whole script, but it’s 1000 lines long so that will be slow. Your supervisor tells you to call them back when you’re ready. You re-run the script, but your exciting result is still nowhere to be found—it must have been the result of an old-code-new-environment error too 😱

Topics for further reading