So you want to: merge your data sets

Why you want to do it:

Often times we are interested in exploring connections among variables from different sources. Merging your data files can be a way of collecting all the variables of interest so you can explore research questions and hypotheses about the data.

In addition, merging can be used to label your observations. We go through examples below.

What merging is:

Merging joins columns from two or more data frames based on matching row information across the data frames.

What merging isn’t:

Merging isn’t about joining data frames by rows. Instead row binding is what you need if you have two data frames with the exact same variables (columns) that you want to join in one data frame. For this, try the rbind() function in base R.

How you can do it:

There are a number of different ways of merging files in R. For these examples, let’s consider two data frames:

# The first data frame (Dat1) contains information on 
## ID: organism ID
## Day: day of sampling
## Length: length at sampling

str(Dat1)

'data.frame':   40 obs. of  3 variables:
 $ ID    : chr  "id29" "id30" "id22" "id31" ...
 $ Day   : num  1 2 1 2 1 1 2 1 2 2 ...
 $ Length: num  150 164 110 134 155 125 144 144 139 167 ...

# The second data frame (Dat2) contains information on 
## ID: organism ID
## Day: day of sampling
## Temp: temperature at sampling

str(Dat2)

'data.frame':   40 obs. of  3 variables:
 $ ID  : chr  "id23" "id20" "id25" "id29" ...
 $ Day : num  2 2 2 1 1 1 2 1 2 1 ...
 $ Temp: num  7.1 7.2 6.4 7.4 6.3 7.2 7 8.1 7.1 6.1 ...

in the base package using `merge()`

mDat<-merge(Dat1, # data frame to merge
             Dat2, # other data frame to merge
             by = c("ID", "Day")) # merge by variable(s)

str(mDat) # take a look at the object I made

'data.frame':   40 obs. of  4 variables:
 $ ID    : chr  "id20" "id20" "id21" "id21" ...
 $ Day   : num  1 2 1 2 1 2 1 2 1 2 ...
 $ Length: num  195 139 144 180 110 97 149 184 139 143 ...
 $ Temp  : num  6.1 7.2 8.7 5.3 6.9 7.5 7.2 7.1 8.1 7.1 ...

Note that you don’t need to have the same number of observations (rows) in your two data frames to merge. Merging can be a great way of labelling your data. Here’s an example:

Consider a data frame with strain information for each of your organism IDs:

## ID: organism ID
## Strain: strain of organism

str(Dat3)

'data.frame':   20 obs. of  2 variables:
 $ ID    : chr  "id24" "id30" "id33" "id26" ...
 $ Strain: chr  "A" "C" "B" "C" ...

Note that Dat1 and Dat2 each contained 40 observations - one observation for each of 20 IDs made on each of 2 days.

In contrast Dat3 only has 20 observations - information about the strain for each of 20 IDs.

By using merge(), we can add the strain information to mDat:

allDat <- merge(mDat, # one data frame
                Dat3, # the other data frame
                by = "ID") # variables to merge by

str(allDat)

'data.frame':   40 obs. of  5 variables:
 $ ID    : chr  "id20" "id20" "id21" "id21" ...
 $ Day   : num  1 2 1 2 1 2 1 2 1 2 ...
 $ Length: num  195 139 144 180 110 97 149 184 139 143 ...
 $ Temp  : num  6.1 7.2 8.7 5.3 6.9 7.5 7.2 7.1 8.1 7.1 ...
 $ Strain: chr  "C" "C" "A" "A" ...

Now we have our observations labelled by the strain information!

Some things to note:

if you leave out the by = function totally, R will look for column names that are similar between the two data frames and use that for the merge.
you can designate that the “merge by” variables have different column names in the two data frames. This is done with the by.x = and by.y = arguments. Check ?merge for more.
you can control what happens to unmatched columns (e.g. if an ID appeared in only one of the two data frames). This is done with the all =, all.x =, and all.y = arguments. Check ?merge for more.

in the dplyr package using `full_join()`

The dplyr package includes the full_join() function as another way to merge your data frames

library(dplyr) # load dplyr package


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

mDat<-full_join(Dat1, # data frame to merge
                Dat2, # other data frame to merge
                by = join_by(ID, Day)) # merge by column

str(mDat) # take a look at the object I made

'data.frame':   40 obs. of  4 variables:
 $ ID    : chr  "id29" "id30" "id22" "id31" ...
 $ Day   : num  1 2 1 2 1 1 2 1 2 2 ...
 $ Length: num  150 164 110 134 155 125 144 144 139 167 ...
 $ Temp  : num  7.4 7.3 6.9 7.5 5.6 7.8 7.8 6.3 7.9 7.1 ...