Some people use R like Excel: they open RStudio, load their workspace, and pick up where they left off last time, running commands interactively in the console (perhaps getting these from a .R script that’s used like a word document to store useful commands). This may be you if:
You save your workspace image when you close RStudio
Your scripts are many hundreds or even thousands of lines long
Your data files have names like data_jan16
,
data_jan17
, data_cleaned_jan17
, etc.
You sometimes get different results when you re-run the same command (!)
R should actually be used like Python (or Julia, JavaScript, C, or
any other programming language): scripts are sets of instructions that
should be run in sequence in their entirety to achieve
a well-defined purpose (analysis_jan16.R
is not a
well-defined purpose), and the console should be used to debug scripts.
Note that this is not a .R script. It’s an R notebook (.Rmd), which
contains text (like this) alongside code.
The goal of this workshop will be to show you how and why to use R like Python. We’ll specifically focus on preprocessing data, which often turns out to be 80% of the work of data analysis.
R (like other programming languages) has a few basic “atomic” data types. More complex pieces of data (like spreadsheets) are built up from these basic data types:
TRUE
and
FALSE
'hello'
or "goodbye"
There are others, but these are the main 3 you’ll encounter.
You can give a piece of data a name to refer back to it later—this is
called assigning it to a variable. The arrow operator <-
(which you can write using alt + hyphen) accomplishes this:
my_bool <- TRUE
print(my_bool)
## [1] TRUE
Atomic pieces of data can be combined into more complex “data structures”. The main 3 data structures you’ll interact with are atomic vectors, lists, and data frames.
Atomic vectors are sequences of atomic pieces of data. They’re
created with c(...)
like so:
my_vector <- c(2, 4, 8, 16)
Note that all the elements in an atomic vector have to be of the same type:
bools_and_nums <- c(TRUE, FALSE, 1, 2, 3)
print(bools_and_nums)
## [1] 1 0 1 2 3
and_strings <- c(TRUE, FALSE, 1, 2, 3, 'a', 'b', 'c')
print(and_strings)
## [1] "TRUE" "FALSE" "1" "2" "3" "a" "b" "c"
To index (access) specific elements within a vector, you can use square brackets containing the indices of the elements you want:
print(my_vector)
## [1] 2 4 8 16
my_vector[1] # Single element
## [1] 2
my_vector[2:4] # Sequence of elements
## [1] 4 8 16
my_vector[c(1, 3)] # Only certain elements
## [1] 2 8
my_vector[-c(1, 3)] # All but certain elements
## [1] 4 16
This is called “integer indexing”, but you can also do logical indexing. Here you create a vector of Booleans that’s TRUE for each value you want to access and FALSE for each value you want to ignore:
print(my_vector)
## [1] 2 4 8 16
idx <- my_vector < 5
print(idx)
## [1] TRUE TRUE FALSE FALSE
print(my_vector[idx])
## [1] 2 4
Lists are like vectors (they are also sequences of data) but less constrained: their contents don’t have to be atomic and don’t have to be all the same type. Usually in a list, each element is associated with a different name:
my_list <- list(name1 = 'data 1',
name2 = c(1, 2, 3))
print(my_list)
## $name1
## [1] "data 1"
##
## $name2
## [1] 1 2 3
Then, to access an element of a list by name, you can use a dollar sign:
print(my_list$name1)
## [1] "data 1"
print(my_list$name2)
## [1] 1 2 3
Data frames are a very important data structure because they’re used to store data read from a spreadsheet. A data frame is like a list of atomic vectors: you access each column the way you’d access elements in a list, and each column is an atomic vector.
my_df <- read.csv('example-data.csv')
print(my_df)
## rt choice cond
## 1 0.9110855 old a
## 2 8.4787557 new b
## 3 8.1044009 new a
## 4 12.2404803 old b
## 5 4.3951396 old a
## 6 13.3365931 old b
## 7 19.4296267 old a
## 8 4.8182342 new b
## 9 3.4131416 old a
## 10 0.1668238 new b
print(my_df$rt) # dollar sign
## [1] 0.9110855 8.4787557 8.1044009 12.2404803 4.3951396 13.3365931
## [7] 19.4296267 4.8182342 3.4131416 0.1668238
print(my_df$rt[2:4]) # dollar sign and square brackets
## [1] 8.478756 8.104401 12.240480
You can access sections of data frames using a special square brackets notation, where you give first the rows you want (using any index that would work for vectors) followed by the columns you want (as a vector of strings):
row_idx <- c(1, 3)
col_idx <- c('rt', 'cond')
sub_df <- my_df[row_idx, col_idx]
print(sub_df)
## rt cond
## 1 0.9110855 a
## 3 8.1044009 a
To to access all rows for a given subset of columns, just omit the row index (and vice versa) but keep the comma:
print(my_df[, col_idx])
## rt cond
## 1 0.9110855 a
## 2 8.4787557 b
## 3 8.1044009 a
## 4 12.2404803 b
## 5 4.3951396 a
## 6 13.3365931 b
## 7 19.4296267 a
## 8 4.8182342 b
## 9 3.4131416 a
## 10 0.1668238 b
print(my_df[row_idx, ])
## rt choice cond
## 1 0.9110855 old a
## 3 8.1044009 new a
Spreadsheets should be organized according to the tidy data standard, where each row corresponds to a different observation (at the spreadsheet’s level of observation) and each column corresponds to a different measured variable. Typical examples:
Each row is a different participant. There are columns for age, gender, etc.
Each row is a different trail from an experiment. There are columns for RT, response, condition, etc.
Each row is a different (within-subjects) experimental condition for a different participant. There are columns for things that are specific to the participant (e.g., age) and columns for things are specific to the experimental condition for that participant (e.g., mean RT)
It’s important to be clear about what the unit/level of observation is for a given data frame: in the first example, individual participants are the unit of observation, whereas the second describes trial-level data. Understanding how your data is organized helps you think clearly about it (e.g., avoiding mistakes like the one described in this legendary thread).
In general, it’s best to store any spreadsheet that you might want to read with R in .csv files vs .xlsx files. CSVs are simple text files where side-by-side cells are separated by commas and each row is on a different line. CSVs have 2 main advantages:
They can be opened with any software—you don’t need Excel, let alone a specific version of Excel, nor a special R library for reading Excel files
Excel won’t let you save CSVs with things like coloured cells or formulas or anything else that R won’t know how to read (Excel might give you scary warnings that some data will be lost if you use CSV format—be brave)
To learn how to manipulate data frames, I’ve generated some example data for this workshop. Suppose we’ve run an intervention and collected performance on some memory task at 3 time points. We’ve also collected demographic info. Let’s read the files into R:
demog <- read.csv('demographics.csv')
task_pre <- read.csv('task-pre.csv')
task_mid <- read.csv('task-mid.csv')
task_post <- read.csv('task-post.csv')
As we saw earlier, we can select a subset of rows from a data frame
using square bracket notation. R also has a convenient
subset
function that allows us to do this a little more
succinctly when we want to filter out rows that don’t match some
criterion:
# Method 1: square brackets
slow_rt_idx <- task_pre$mean_rt > 1
print(slow_rt_idx)
## [1] FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
print(task_pre[slow_rt_idx, ])
## id mean_rt ppn_correct
## 2 P007 1.017769 0.6808997
## 7 P003 1.037498 0.5489067
## 8 P006 1.012592 0.5978999
# Method 2: subset function
print(subset(task_pre, mean_rt > 1))
## id mean_rt ppn_correct
## 2 P007 1.017769 0.6808997
## 7 P003 1.037498 0.5489067
## 8 P006 1.012592 0.5978999
Note that the subset
function can only do logical
indexing. It isn’t used to, for example, access the first row of a data
frame.
One way to combine data frames is to stack them on top of one another
using the rbind
function. For example, suppose we want to
combine pre-, mid-, and post-session task data this way:
stacked <- rbind(task_pre, task_mid, task_post)
print(stacked)
## id mean_rt ppn_correct
## 1 P004 0.9940302 0.85530451
## 2 P007 1.0177688 0.68089972
## 3 P005 0.8747640 0.71882082
## 4 P002 0.9370050 0.33467732
## 5 P001 0.7810016 0.60323019
## 6 P008 0.8966036 0.49633942
## 7 P003 1.0374975 0.54890673
## 8 P006 1.0125920 0.59789987
## 9 P001 1.0808253 0.29404723
## 10 P007 0.9053562 0.06180917
## 11 P003 0.8050702 0.73247260
## 12 P004 1.1213398 0.98107790
## 13 P006 0.7467665 0.27127975
## 14 P008 0.9261200 0.77565487
## 15 P002 1.0827661 0.73563547
## 16 P005 0.8617424 0.26777398
## 17 P007 0.8357853 0.51299325
## 18 P001 0.9761309 0.75279619
## 19 P003 0.8089268 0.76735500
## 20 P005 0.7655857 0.14172428
## 21 P006 0.9018639 0.66236845
## 22 P004 0.8351172 0.84887690
## 23 P002 1.0295283 0.74715848
## 24 P008 0.7821590 0.51639406
Note that the rbind
function expects that all data
frames being stacked have the same column names and it will line them up
automatically (the columns don’t have to be in the same order).
We named the individual tables task_pre
,
task_mid
, and task_post
, but in our stacked
table we have no way of knowing if an individual observation comes from
the pre-session, mid-session, post-session. Therefore we need to encode
this in a new column which we can call “session”
task_pre$session <- "pre"
print(task_pre)
## id mean_rt ppn_correct session
## 1 P004 0.9940302 0.8553045 pre
## 2 P007 1.0177688 0.6808997 pre
## 3 P005 0.8747640 0.7188208 pre
## 4 P002 0.9370050 0.3346773 pre
## 5 P001 0.7810016 0.6032302 pre
## 6 P008 0.8966036 0.4963394 pre
## 7 P003 1.0374975 0.5489067 pre
## 8 P006 1.0125920 0.5978999 pre
task_mid$session <- "mid"
print(task_mid)
## id mean_rt ppn_correct session
## 1 P001 1.0808253 0.29404723 mid
## 2 P007 0.9053562 0.06180917 mid
## 3 P003 0.8050702 0.73247260 mid
## 4 P004 1.1213398 0.98107790 mid
## 5 P006 0.7467665 0.27127975 mid
## 6 P008 0.9261200 0.77565487 mid
## 7 P002 1.0827661 0.73563547 mid
## 8 P005 0.8617424 0.26777398 mid
task_post$session <- "post"
print(task_post)
## id mean_rt ppn_correct session
## 1 P007 0.8357853 0.5129933 post
## 2 P001 0.9761309 0.7527962 post
## 3 P003 0.8089268 0.7673550 post
## 4 P005 0.7655857 0.1417243 post
## 5 P006 0.9018639 0.6623684 post
## 6 P004 0.8351172 0.8488769 post
## 7 P002 1.0295283 0.7471585 post
## 8 P008 0.7821590 0.5163941 post
stacked <- rbind(task_pre, task_mid, task_post)
print(stacked)
## id mean_rt ppn_correct session
## 1 P004 0.9940302 0.85530451 pre
## 2 P007 1.0177688 0.68089972 pre
## 3 P005 0.8747640 0.71882082 pre
## 4 P002 0.9370050 0.33467732 pre
## 5 P001 0.7810016 0.60323019 pre
## 6 P008 0.8966036 0.49633942 pre
## 7 P003 1.0374975 0.54890673 pre
## 8 P006 1.0125920 0.59789987 pre
## 9 P001 1.0808253 0.29404723 mid
## 10 P007 0.9053562 0.06180917 mid
## 11 P003 0.8050702 0.73247260 mid
## 12 P004 1.1213398 0.98107790 mid
## 13 P006 0.7467665 0.27127975 mid
## 14 P008 0.9261200 0.77565487 mid
## 15 P002 1.0827661 0.73563547 mid
## 16 P005 0.8617424 0.26777398 mid
## 17 P007 0.8357853 0.51299325 post
## 18 P001 0.9761309 0.75279619 post
## 19 P003 0.8089268 0.76735500 post
## 20 P005 0.7655857 0.14172428 post
## 21 P006 0.9018639 0.66236845 post
## 22 P004 0.8351172 0.84887690 post
## 23 P002 1.0295283 0.74715848 post
## 24 P008 0.7821590 0.51639406 post
Note that even though we provided single strings ("pre"
,
"mid"
, and "post"
) to create the new columns,
those single strings ended up being repeated for as many times as was
necessary to fill every row of the data frame. We can also provide a
vector that has as many elements as there are rows in the data
frame—then the new column is populated with the elements of that vector,
one element per row:
total_corr <- task_pre$ppn_correct * 30
print(total_corr)
## [1] 25.65914 20.42699 21.56462 10.04032 18.09691 14.89018 16.46720 17.93700
task_pre$total_correct <- total_corr
print(task_pre)
## id mean_rt ppn_correct session total_correct
## 1 P004 0.9940302 0.8553045 pre 25.65914
## 2 P007 1.0177688 0.6808997 pre 20.42699
## 3 P005 0.8747640 0.7188208 pre 21.56462
## 4 P002 0.9370050 0.3346773 pre 10.04032
## 5 P001 0.7810016 0.6032302 pre 18.09691
## 6 P008 0.8966036 0.4963394 pre 14.89018
## 7 P003 1.0374975 0.5489067 pre 16.46720
## 8 P006 1.0125920 0.5978999 pre 17.93700
Sometimes your data is spread across multiple spreadsheets (no pun
intended). For example, demographic data is stored in one spreadsheet,
task scores in another. Combining these spreadsheets is more complicated
than just copying and pasting them next to each other: you have to make
sure the rows line up properly using the merge
function.
To merge two data frames, you have to specify which columns contain the key that should be used to match up the rows. In this case, both the demographics and the task data have individual participants as their unit of observation, so we specify that the participant IDs should be used to line them up.
merged <- merge(demog, stacked, by = 'id')
print(merged)
## id age mean_rt ppn_correct session
## 1 P001 73 0.9761309 0.75279619 post
## 2 P001 73 0.7810016 0.60323019 pre
## 3 P001 73 1.0808253 0.29404723 mid
## 4 P002 73 0.9370050 0.33467732 pre
## 5 P002 73 1.0827661 0.73563547 mid
## 6 P002 73 1.0295283 0.74715848 post
## 7 P003 77 1.0374975 0.54890673 pre
## 8 P003 77 0.8050702 0.73247260 mid
## 9 P003 77 0.8089268 0.76735500 post
## 10 P004 71 0.9940302 0.85530451 pre
## 11 P004 71 0.8351172 0.84887690 post
## 12 P004 71 1.1213398 0.98107790 mid
## 13 P005 74 0.8747640 0.71882082 pre
## 14 P005 74 0.8617424 0.26777398 mid
## 15 P005 74 0.7655857 0.14172428 post
## 16 P006 71 0.7467665 0.27127975 mid
## 17 P006 71 0.9018639 0.66236845 post
## 18 P006 71 1.0125920 0.59789987 pre
## 19 P007 85 1.0177688 0.68089972 pre
## 20 P007 85 0.8357853 0.51299325 post
## 21 P007 85 0.9053562 0.06180917 mid
## 22 P008 74 0.8966036 0.49633942 pre
## 23 P008 74 0.9261200 0.77565487 mid
## 24 P008 74 0.7821590 0.51639406 post
Note that the merging was possible even though stacked
has more rows than demog
.
Note also that if you merge data frames that share a column name, and
you aren’t using that column name to merge them, the resulting data
frame will have 2 copies of that column called
<column-name>.x
and
<column-name>.y
.
demog$is_pilot <- FALSE
task_pre$is_pilot <- FALSE
merged_2 <- merge(demog, task_pre, by = 'id')
print(merged_2)
## id age is_pilot.x mean_rt ppn_correct session total_correct is_pilot.y
## 1 P001 73 FALSE 0.7810016 0.6032302 pre 18.09691 FALSE
## 2 P002 73 FALSE 0.9370050 0.3346773 pre 10.04032 FALSE
## 3 P003 77 FALSE 1.0374975 0.5489067 pre 16.46720 FALSE
## 4 P004 71 FALSE 0.9940302 0.8553045 pre 25.65914 FALSE
## 5 P005 74 FALSE 0.8747640 0.7188208 pre 21.56462 FALSE
## 6 P006 71 FALSE 1.0125920 0.5978999 pre 17.93700 FALSE
## 7 P007 85 FALSE 1.0177688 0.6808997 pre 20.42699 FALSE
## 8 P008 74 FALSE 0.8966036 0.4963394 pre 14.89018 FALSE
To avoid this, you could either use the redundant column to merge
(by = c('id', 'is_pilot')
) or delete it from one of the
data frames.
You will often want to average over observations in your dataset. The
base R function for this is called aggregate
, although
there are also the popular group_by
and
summarize
functions provided by the dplyr
library.
by_age <- aggregate(mean_rt ~ age, data = merged, FUN = median)
print(by_age)
## age mean_rt
## 1 71 0.9479471
## 2 73 1.0028296
## 3 74 0.8682532
## 4 77 0.8089268
## 5 85 0.9053562
by_session_and_age <- aggregate(ppn_correct ~ session + age, data = merged, FUN = mean)
print(by_session_and_age)
## session age ppn_correct
## 1 mid 71 0.62617883
## 2 post 71 0.75562268
## 3 pre 71 0.72660219
## 4 mid 73 0.51484135
## 5 post 73 0.74997733
## 6 pre 73 0.46895376
## 7 mid 74 0.52171442
## 8 post 74 0.32905917
## 9 pre 74 0.60758012
## 10 mid 77 0.73247260
## 11 post 77 0.76735500
## 12 pre 77 0.54890673
## 13 mid 85 0.06180917
## 14 post 85 0.51299325
## 15 pre 85 0.68089972
Data can be described as (relatively) wide or long. In long data, the units of observation are relatively granular. For example, in our stacked dataset the unit of analysis is the experimental session for a particular participant. As a result, long data contains relatively more rows. In wide data, the units of observation are relatively abstract or superordinate. For example, we might have a data frame where each row is a participant and the task scores for the pre- and post-sessions are stored in separate columns. As a result, wide data contains relatively more columns.
We can move between wide and long format using the
reshape
function in base R (although, as with
aggregate
, the popular library dplyr
has
multiple replacements for this).
To make a data frame wider, we need to specify which column
differentiates observations at the current level of observation (the
timevar
, in this case session
) and which
columns contain measurements at the intended level of observation (the
idvar
, in this case id
and age
).
All variables not in idvar
are assumed to be made at the
current, more granular, level of observation.
wide <- reshape(merged, direction = 'wide',
timevar = 'session',
idvar = c('id', 'age')
)
print(names(wide))
## [1] "id" "age" "mean_rt.post" "ppn_correct.post"
## [5] "mean_rt.pre" "ppn_correct.pre" "mean_rt.mid" "ppn_correct.mid"
Note that the term is “time var” because repeated observations
(longitudinal data) are a paradigmatic case of long data, but the
reshape
function could also be used to differentiate
observations from, e.g., different experimental conditions.
Wide data is often useful for computing difference scores:
diff_scores <- wide$mean_rt.post - wide$mean_rt.pre
hist(diff_scores)
t.test(diff_scores)
##
## One Sample t-test
##
## data: diff_scores
## t = -1.5029, df = 7, p-value = 0.1766
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.19820158 0.04416014
## sample estimates:
## mean of x
## -0.07702072
Reshaping from wide to long is less simple, because it’s based on
reading values from column names (e.g., pre
from
mean_rt.pre
). You have to provide the names of all the
columns containing measurements that vary by the new (more granular)
level of observation (varying
) along with the name of the
new level of observation (timevar
again).
Sometimes the column names in varying
can be correctly
parsed by R automatically, but to be safe you can also provide the
specific names of the variables that were measured multiple times
(v.names
) along with the possible values that
timevar
can take on:
long <- reshape(wide, direction = 'long',
varying = c('mean_rt.pre',
'mean_rt.mid',
'mean_rt.post',
'ppn_correct.pre',
'ppn_correct.mid',
'ppn_correct.post'),
timevar = 'session',
v.names = c('mean_rt', 'ppn_correct'),
times = c('pre', 'mid', 'post'))
My general advice would be to use long format for long-term storage and reshape to wide as needed for analysis. This helps you avoids headaches like the above.
In general, code should not be repetitive. If you find yourself copying and pasting the same piece of code and making small changes, this is a sign that you should use a loop instead. Loops allow you to iterate over a sequence of values and apply the same set of operations to each of them. That sounds very general because it is! Loops are very generally applicable. Let’s make it more concrete:
sequence_of_vals <- c(1, 2, 4, 8, 16)
for (current_val in sequence_of_vals) {
# First operation: increment
plus_one <- current_val + 1
# Second operation: print
print(plus_one)
}
## [1] 2
## [1] 3
## [1] 5
## [1] 9
## [1] 17
Here are some common research scenarios where you might use loops:
Some people might argue that you should use the apply
family of functions for these tasks (particularly #2), but I often find
it easiest to just write a loop.
In the last part of the workshop, we’ll do some live coding to compile data files from multiple participants and sessions. But first…
Suppose, for whatever reason, we want to re-stack the session-specific task data frames. Let’s scroll up to the stacking section and re-run the code there.
Why would we get an error running code that worked before?
It’s because we altered the task_pre
table in the
section on merging
(task_pre$total_correct <- total_corr
). When we go back
and re-run the old stacking code, it’s using the altered
task_pre
data frame.
This is a mistake that’s very easy to make when you’re using R like Excel. Scrolling up to re-run old code creates the illusion that you’re turning back the clock, but in fact there is a single, current “environment” of variables in R (see the panel to the right) that determines what data your code is referring to at the time you run it. When you run old code in a new environment, all bets are off.
A spooky cautionary story:
You’re using R like Excel one day and you get an exciting result. Your supervisor isn’t around at the moment but you make a mental note to show them and you drop the command to generate the result into your script. You go on doing analysis until your supervisor walks by. You scroll up and re-run the command, but now it returns an error—probably an old-code-new-environment error. This means you have to re-run your whole script, but it’s 1000 lines long so that will be slow. Your supervisor tells you to call them back when you’re ready. You re-run the script, but your exciting result is still nowhere to be found—it must have been the result of an old-code-new-environment error too 😱
Operator precedence (order of operations, especially for Booleans)