So you want to: explore your data

Why you want to do it:

A first step to data analysis is exploring your data. Here you will have a first look at the data and identify any errors.

How you can do it:

You can run through a series of checks to have a first look at your data. Here is TData as an example data set:

Does it exist?

First thing to do when we read in data is to check that it was read correctly. One of the simplest things to do is to see if it exists in our workspace (this goes for any object, not only imported data). We can see what’s in our workspace with ls() and get a little more information with ls.str().

How big is it?

For example, dim() gives us the dimensions (number of rows and number of columns) of your data frame:

 dim(TData)
[1] 184   4

You can also get the number of elements in a vector with the length() function:

V3 <- c(3.2, 0.9, 2.34, 5.4) # a vector containing numeric decimal values

length(V3)
[1] 4

Viewing your data object

Typing the object name will have R trying to print the entire object to your screen, but there are some useful functions to get a look at our objects, without having to see everything.

Try head() and tail() to give you a look at the first and last (respectively) few rows of the data frame:

 head(TData)
  Station Year Month      Temp
1     S27 2000    11  6.184429
2     HL2 2004     3 -1.155200
3     HL2 2000     2  1.524000
4     S27 2005     9 12.681000
5     HL2 2005     6  6.208400
6     S27 2003     8 11.737000
 tail(TData)
    Station Year Month      Temp
179     S27 2003    12  3.923000
180     HL2 2005     3  0.189400
181     HL2 2007    10 13.511505
182     S27 2002     6  5.227500
183     S27 2005     3 -1.014667
184     HL2 2007    11  9.963482

The default number of rows head() or tail() display is 6, but you can change this by providing a second argument (n = ...), e.g.:

 head(TData, n = 3) # Show first 3 rows of ChlData
  Station Year Month      Temp
1     S27 2000    11  6.184429
2     HL2 2004     3 -1.155200
3     HL2 2000     2  1.524000

You can also get a quick look at your data by asking for the summary and/or structure.

The summary() function gives you a summary of each column:

summary(TData)
   Station               Year          Month             Temp       
 Length:184         Min.   :1999   Min.   : 1.000   Min.   :-1.551  
 Class :character   1st Qu.:2001   1st Qu.: 3.000   1st Qu.: 1.014  
 Mode  :character   Median :2003   Median : 6.000   Median : 5.167  
                    Mean   :2003   Mean   : 6.457   Mean   : 5.811  
                    3rd Qu.:2005   3rd Qu.: 9.250   3rd Qu.:10.028  
                    Max.   :2007   Max.   :12.000   Max.   :16.472  

The str() function describes the data types in each column:

str(TData)
'data.frame':   184 obs. of  4 variables:
 $ Station: chr  "S27" "HL2" "HL2" "S27" ...
 $ Year   : int  2000 2004 2000 2005 2005 2003 2006 2003 2005 2001 ...
 $ Month  : int  11 3 2 9 6 8 8 4 4 6 ...
 $ Temp   : num  6.18 -1.16 1.52 12.68 6.21 ...

Viewing part of your data object

Viewing a specific element of a vector or data frame

An element is one data point. You can view an element by using the element’s position in the object. You do this using the object name and square brackets [ ].

Since a vector is a one-dimensional object (see also the section on Importing), you will only need to give the position of the element.

For example, to get the 3rd element of vector V3:

[1] 3.20 0.90 2.34 5.40

you would use:

V3[3]
[1] 2.34

Since a data frame is a 2-dimensional object, you need to give two positions representing the element’s row and column. You do this again with the square brackets [ ] as well as a comma , separating the row and column numbers.

For example, with the data frame DF1:

print(DF1)
  Year Month Abundance
1 2021   Jan        43
2 2021   Feb        24
3 2021   Mar         5

The Abundance value 24 is in row #2 and column #3, so you access it with:

DF1[2,3]
[1] 24

Note that you can also access the whole 2nd row with:

DF1[2,]
  Year Month Abundance
2 2021   Feb        24

and the whole 3rd column with:

DF1[,3]
[1] 43 24  5

Note that in data frames, you can also indicate the column by the column name. You do this using the $ symbol. You can also get the 3rd (Abundance) column with:

DF1$Abundance
[1] 43 24  5

and the abundance value 24 is located with

DF1$Abundance[2]
[1] 24

A conditional view of your data

Sometimes you do not know the location (row and column) of the data you want to find. Instead you have a condition for the data you want, e.g. which observations include plant height greater than 10cm?

Let’s look at some example data to answer this question:

   PlotID PlantHeight
1       F         8.3
2       F         6.7
3       B        10.2
4       D         8.2
5       D        11.4
6       F         6.8
7       F        10.7
8       C        10.4
9       F         9.8
10      F        11.2

There are two ways of finding the obcervations with plant heights greater than 10cm. The first is to locate the row numbers where the condition is true (plant heights are greater than 10cm). You can do this with the which() function.

inds <- which(myDat$PlantHeight > 10) # the row numbers that fulfill the condition

myDat[inds,] # rows meeting the condition
   PlotID PlantHeight
3       B        10.2
5       D        11.4
7       F        10.7
8       C        10.4
10      F        11.2

Another way to find observations with plant heights greater than 10cm is to use the subset() function. This subsets the data directly to only give you the observations meeting the condition.

subset(myDat, PlantHeight > 10)
   PlotID PlantHeight
3       B        10.2
5       D        11.4
7       F        10.7
8       C        10.4
10      F        11.2

Useful Logical Functions (operators):

These functions will help you make conditional statements. Note that these functions do not look like “regular” functions in R. They are some times called operators instead.

Operator Description
> greater than
>= greater than or equal to
< less than
<= less than or equal to
== exactly equal to
!= not equal to

Note that you can include more than one condition by joining them with either & (representing AND) or | (representing OR).

For example, you might want to find observations from plant heights greater than 10 cm and only from Plot F. This can be done with:

subset(myDat, PlantHeight > 10 & PlotID == "F")
   PlotID PlantHeight
7       F        10.7
10      F        11.2

Another example, you might want to find observations with large plants (greater than 10 cm) and small plants (less than 7 cm). You can do this by finding observations of either one OR the other condition:

subset(myDat, PlantHeight > 10 | PlantHeight < 7)
   PlotID PlantHeight
2       F         6.7
3       B        10.2
5       D        11.4
6       F         6.8
7       F        10.7
8       C        10.4
10      F        11.2

Another helpful function is %in%. This function checks for membership. For example, I might want to know which observations are in Plot B, D or C:

subset(myDat, PlotID %in% c("B", "C", "D"))
  PlotID PlantHeight
3      B        10.2
4      D         8.2
5      D        11.4
8      C        10.4

Note that this would be the same as using multiple OR (|) operators.

subset(myDat, PlotID == "B" | PlotID == "C" | PlotID == "D")
  PlotID PlantHeight
3      B        10.2
4      D         8.2
5      D        11.4
8      C        10.4
Data Skills Portfolio Program CC BY-NC-SA 4.0