So you want to: merge your data sets (and/or label your observations)
Why you want to do it:
Merging will help you when:
you want to explore data from different sources (e.g. two or more data frames), or
you want to label your observations
Merging your data files collects all the variables of interest and/or labels your observations so you can explore research questions and hypotheses about the data.
Merging data frames
What merging data frames is:
Merging joins columns from two or more data frames based on matching row information across the data frames. The variable(s) or column(s) used to match the data frames is the “key” or “by” column. In this example, the key column is ID.
What merging data frames is not:
Merging is not about joining data frames by rows. Instead row binding is what you need if you have two data frames with the exact same variables (columns) that you want to join in one data frame. For this, try the rbind() function in base R.
How you can merge data frames:
There are a number of different ways of merging files in R. For these examples, let us consider two data frames:
# The first data frame (Dat1) contains information on ## ID: organism ID## Day: day of sampling## Length: length at samplingstr(Dat1)
'data.frame': 40 obs. of 3 variables:
$ ID : chr "id29" "id30" "id22" "id31" ...
$ Day : num 1 2 1 2 1 1 2 1 2 2 ...
$ Length: num 150 164 110 134 155 125 144 144 139 167 ...
# The second data frame (Dat2) contains information on ## ID: organism ID## Day: day of sampling## Temp: temperature at samplingstr(Dat2)
'data.frame': 40 obs. of 3 variables:
$ ID : chr "id23" "id20" "id25" "id29" ...
$ Day : num 2 2 2 1 1 1 2 1 2 1 ...
$ Temp: num 7.1 7.2 6.4 7.4 6.3 7.2 7 8.1 7.1 6.1 ...
The “by” or “key” variables here will be ID and Day. These will be the columns R uses to match the rows between the two data frames. Note that the observations are not in the same order when you compare Dat1 and Dat2. This is fine, as R will look for row matches using the “by” variables to merge the files.
merging with the base package using merge()
mDat<-merge(Dat1, # data frame to merge Dat2, # other data frame to mergeby =c("ID", "Day")) # merge by variable(s)str(mDat) # take a look at the object I made
'data.frame': 40 obs. of 4 variables:
$ ID : chr "id20" "id20" "id21" "id21" ...
$ Day : num 1 2 1 2 1 2 1 2 1 2 ...
$ Length: num 195 139 144 180 110 97 149 184 139 143 ...
$ Temp : num 6.1 7.2 8.7 5.3 6.9 7.5 7.2 7.1 8.1 7.1 ...
Some things to note:
if you leave out the by = function totally, R will look for column names that are similar between the two data frames and use that for the merge.
you can designate that the “merge by” variables have different column names in the two data frames. This is done with the by.x = and by.y = arguments. Check ?merge for more.
you can control what happens to unmatched columns (e.g. if an ID appeared in only one of the two data frames). This is done with the all =, all.x =, and all.y = arguments. Check ?merge for more.
merging with the dplyr package using full_join()
The dplyr package includes the full_join() function as another way to merge your data frames
library(dplyr) # load dplyr package
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
mDat<-full_join(Dat1, # data frame to merge Dat2, # other data frame to mergeby =join_by(ID, Day)) # merge by columnstr(mDat) # take a look at the object I made
'data.frame': 40 obs. of 4 variables:
$ ID : chr "id29" "id30" "id22" "id31" ...
$ Day : num 1 2 1 2 1 1 2 1 2 2 ...
$ Length: num 150 164 110 134 155 125 144 144 139 167 ...
$ Temp : num 7.4 7.3 6.9 7.5 5.6 7.8 7.8 6.3 7.9 7.1 ...
Some things to note:
The full_join() function keeps all observations appearing in either data frame.
The left_join() function keeps all observations in the first data frame (Dat1) but you will lose any unmatched observations in the second data frame (Dat2).
The right_join() function keeps all observations in the second data frame (Dat2) but you will lose any unmatched observations in the first data frame (Dat1)
Check your merge!
You should always check your merged data frame to make sure the merge behaved as you expected. This can be done by taking a random row in the first file (Dat1) and comparing the information for that row (ID and Day) in the second file (Dat2) and the merged file (mDat). Here is an example:
k <-5# pick a random row to check. Let's choose 5! Dat1[k,] # looking at the 5th row of the first data frame
ID Day Length
20 id31 1 155
indsD2 <-which(Dat2$ID == Dat1$ID[k] & Dat2$Day == Dat1$Day[k]) # which row in Dat2 has the ID and Day from the kth row in Dat1indsD2 # row with matching data
[1] 19
Dat2[indsD2, ] # matching data
ID Day Temp
20 id31 1 5.6
indsMD <-which(mDat$ID == Dat1$ID[k] & mDat$Day == Dat1$Day[k]) # which row in mDat has the ID and Day from the kth row in Dat1indsMD # row with matching data
[1] 5
mDat[indsMD, ] # matching data
ID Day Length Temp
5 id31 1 155 5.6
## The length and temperature in the merged data (mDat) match the data in the original data frames (Dat1 & Dat2)
Labelling data with merge
You can also use merge functions to label your observations (rows) with a lookup table. A lookup table matches input variables to output variables.
Let us look at an example:
Consider a lookup table that is a data frame that maps strain information to each of your organism IDs:
## ID: organism ID## Strain: strain of organismstr(lookupTab)
'data.frame': 20 obs. of 2 variables:
$ ID : chr "id24" "id30" "id33" "id26" ...
$ Strain: chr "A" "C" "B" "C" ...
Note that mDat contains 40 observations - one observation for each of 20 IDs made on each of 2 days.
In contrast lookupTab only has 20 observations - information about the strain for each of 20 IDs. There is no Day information in lookupTab as it is only a lookup table that will allow you to add strain to your observations.
By using merge(), you can add the strain information to mDat:
allDat <-merge(mDat, # one data frame lookupTab, # the lookup tableby ="ID") # variables to merge bystr(allDat)
'data.frame': 40 obs. of 5 variables:
$ ID : chr "id20" "id20" "id21" "id21" ...
$ Day : num 2 1 2 1 1 2 2 1 2 1 ...
$ Length: num 139 195 180 144 110 97 184 149 143 139 ...
$ Temp : num 7.2 6.1 5.3 8.7 6.9 7.5 7.1 7.2 7.1 8.1 ...
$ Strain: chr "C" "C" "A" "A" ...
Now you have our observations labelled by the strain information!