Chapter 6 Preparing the data for analysis

6.1 Overview

It is rarely the case that data that are fresh from field work or obtained directly from an instrument are immediately ready for analysis and visualisation. Often there are minor inconistences and errors in the data that need to be cleared out of the data. Identifying and removing these contaminants is rarely taught during undergraduate or gradute training, and as a result, many researchers spend a lot of time struggling with making their data fit for analysis and visualisation. The aim of this chapter is to investigate five key concepts for cleaning data with R. This chapter will give you the tools to identify problems with your data, fix them quickly and simply, and get your data into a suitable shape for analysis and visualisation. First we will ensure that you can navigate your data easily, then we will describe five data cleaning concepts, and finally we will explore some methods for joining tables together.

6.3 Five concepts for cleaning data

Now that we have our data imported into R, and we have some familiarity with how to access it, we can work on cleaning and tidying the data to make it ready for analysis and visualisation. Data cleaning can profoundly influence the results of a statistical analysis, so it is important to take care with these steps. I have divided this process into five common tasks that I routinely do when cleaning for analysis (cf Jonge and Loo 2013):

  1. Fixing names: correting spelling, replacing spaces, etc.
  2. Converting column classes: character to numeric, extracting numbers from strings, etc.
  3. Splitting and combining columns: separating one col into two or more
  4. Reshaping: from wide to long and long to wide

6.3.1 Fixing names

Fixing names refers to two tasks: putting the correct column names on a data frame, and correcting misspellings of names of items with columns in the data frame. A common problem with column names on data frames is that the ones your expect are missing, and instead you have column names like X1, X2, X3… or V1, V2, V3… We encountered this problem in the previous chapter when we used the extract_tables() function from the tabulizer package to extract a table from a PDF. The first challenge with the output of that function is that it returns a matrix:

class(table_from_pdf[[1]])
## [1] "matrix"

So we use as_data_frame() from the dplyr package to convert this into a tibble data frame (there is a as.data.frame() function for which no package is needed, but I find that the dplyr version has more useful defaults settings). We need to convert the matrix into a data frame so that we can have column with different classes, for example the ‘Sample’ column needs to be character class so it can hold a mix of letters and numbers, and the measurement columns need to be numeric so we can compute on them. So we can convert to a data frame, and then use head() to inspect the output:

terry_table <- as_data_frame(table_from_pdf[[1]])
head(terry_table)
## # A tibble: 6 x 6
##   V1     V2        V3      V4      V5    V6          
##   <chr>  <chr>     <chr>   <chr>   <chr> <chr>       
## 1 ""     Mehlich P Olsen P Total P ""    Ring Test   
## 2 Sample (mg/kg)   (mg/kg) (mg/kg) ""    Rating (1–5)
## 3 O15-1  17.3      6.5     1762    ""    2           
## 4 10G-1  18.3      6.5     1627    ""    2           
## 5 N9-10  18.9      9.4     1496    ""    3           
## 6 10H-1  20.6      4.5     1657    ""    4

In the output we see that the actual column names are V1, V2, V3, etc., and the column names we want are in the first and second row. If the column names were exclisvely in the first row, we could assign the first row to the data frame column names, and then delete the first row. The pattern for this is:

names(my_df) <- my_df[ 1, ]      # assign first row to the column names
my_df        <- my_df[-1, ]      # delete the first row 

Or for the same result in one line, we can use the assign_colnames() function from the docxtractr package. In the example below the :: saves us from having to type library(docxtractr) to make the assign_colnames() function available to our R session:

# move the first row of the data frame to the column names
my_df <- docxtractr::assign_colnames(my_df, 1)

This is an ideal solution for this common problem where the column names are in the first row and we want to move them to the proper place. However, our example here is slighly more complex because the column names are spread across the first and second rows of the data frame. In this case, the general strategy is to create a character vecttor that is the result of pasting together the first and second row for each column, then proceed as above and move the first row to the column names. In the code below we take the first and second rows of the data frame (terry_table[c(1:2), ]), and use the paste() function to combine them into a a single character string for each column. The collapse = " " argument to to the paste() function indicates that we want to collapse the two items into one item where they are separated by a space. So that “Total P” (row 1 col 4), and “(mg/kg)” (row 2 col 4) become one item: “Total P (mg/kg)”. To automate this process across each column of the ‘terry_table’ data frame, we use the map_chr() function, which also converts the output to a character vector (compare to the generic map() which returns a list):

# extract and combine first two rows
terry_table_col_names <- 
  map_chr(terry_table[c(1:2), ], 
          ~paste(.x, collapse = " "))

# delete first two rows from the table
terry_table <- terry_table[-c(1:2), ]

# view output 
terry_table_col_names
##                       V1                       V2                       V3 
##                " Sample"      "Mehlich P (mg/kg)"        "Olsen P (mg/kg)" 
##                       V4                       V5                       V6 
##        "Total P (mg/kg)"                      " " "Ring Test Rating (1–5)"

The output is quite good, we have combined the first and second row of each column to get a meaningful set of column names. But a few problems remain: there is a space before the S in ‘Sample’ in the first item in the resulting character vector, the fifth item is only a space, with no text, and the sixth item does not have the correct symbols. The leading space in the first item is s nuisance and can easily be removed with the function trimws() which trims white space from the start and end of a character string (but not the internal spaces):

terry_table_col_names <- 
trimws(terry_table_col_names)

# see the result
terry_table_col_names
##                       V1                       V2                       V3 
##                 "Sample"      "Mehlich P (mg/kg)"        "Olsen P (mg/kg)" 
##                       V4                       V5                       V6 
##        "Total P (mg/kg)"                       "" "Ring Test Rating (1–5)"

The empty element in the fifth item is due to the extract_tables() function guessing that there was a fifth column in this table. However, we saw in our earlier inspections that the fifth column contains no values, so it can safely be deleted from the table, and the fifth element of the names can also be deleted:

# delete the fifth column 
terry_table <- terry_table[ , -5]

# delete the fifth element of the names vector
terry_table_col_names <- terry_table_col_names[-5]

The only issue remaining now is the incorrect reading of the characters in the name of the last column. These errors are likely due to subtle differences in the encoding of numbers and letters in the PDF, and the types of encoding that R can easily handle. Encoding is a complex topic relating to the rules a computer follows when it stores human-readable characters as zeros and ones. In any case, the simplest fix is to directly update that item by replacing it with what we can see in the PDF that we got the data from:

terry_table_col_names[5] <- "Ring Test Rating (1-6)" 

If we had a large table with many columns, and most of the column names had an encoding issue like this, we would want an automated method to deal with all the columns at once, rather than directly updating each element by hand as we did here. For example, we could use the parse_character() fucntion from the readr package to convert the encoding to something that looks sensible:

terry_table_col_names <- 
  parse_character(terry_table_col_names, 
                  locale = locale(encoding = "UTF-8"))

The final step here is to assign the character vector of column names to the column names of the data frame:

names(terry_table) <- terry_table_col_names

# inspect the result:
head(terry_table)
## # A tibble: 6 x 5
##   Sample `Mehlich P (mg/… `Olsen P (mg/kg… `Total P (mg/kg…
##   <chr>  <chr>            <chr>            <chr>           
## 1 O15-1  17.3             6.5              1762            
## 2 10G-1  18.3             6.5              1627            
## 3 N9-10  18.9             9.4              1496            
## 4 10H-1  20.6             4.5              1657            
## 5 10D-4  22.0             7.0              1143            
## 6 N9-12  23.0             10.0             1103            
## # … with 1 more variable: `Ring Test Rating (1-6)` <chr>

That completes the process of fixing the column names for this table, which is a typical set of operations for cleaning a data frame to prepare it for analysis. We still need to convert the column classes for some of the columns from character to numeric, but we will do that in a later section.

The second important task relating to fixing names is correcting data entry errors in values in a column. Suppose we have this simple table of artefact by colours, compiled by a group of undergraduate students:

# make a simple data frame
artefacts_by_colour <- 
  dplyr::data_frame(colour = c("red", " green", "greenish", "green-like", "bleu", "blue", "Red "),
                    mass_kg = c(3, 5, 8, 4, 2, 1, 7))
# have a look
artefacts_by_colour
## # A tibble: 7 x 2
##   colour     mass_kg
##   <chr>        <dbl>
## 1 red              3
## 2 " green"         5
## 3 greenish         8
## 4 green-like       4
## 5 bleu             2
## 6 blue             1
## 7 "Red "           7

At a quick glance at the ‘colour’ column we can see that ‘blue’ is mis-spelled as ‘bleu’, we might want to combine the variants of ‘green’, and ‘red’ appears in the last row with a capital ‘R’, but all the other colour names are in lower case. In a small table like this is a easy to browse the whole column, but in larger tables we can run unique(artefacts_by_colour$colour) to give a character vector of the unique values in the ‘colour’ column, or table(artefacts_by_colour$colour), which returns a table showing each unique value and how many times it occurs in the column. However you discover the problems in a column, the good news is that these are very typical issues that make raw data dirty, and we can easily clean them in R. First, we will fix the case so that all the items in the ‘colour’ column are lower case:

artefacts_by_colour$colour <- tolower(artefacts_by_colour$colour)

# inspect the result
artefacts_by_colour
## # A tibble: 7 x 2
##   colour     mass_kg
##   <chr>        <dbl>
## 1 red              3
## 2 " green"         5
## 3 greenish         8
## 4 green-like       4
## 5 bleu             2
## 6 blue             1
## 7 "red "           7

That has fixed the ‘red’ in the last row, and now we will fix the spelling mistake using the if_else() function. This function is from the dplyr package, there is also an ifelse() in base R, but I prefer the dplyr version because it is more strict, predictable and faster:

library(dplyr)

artefacts_by_colour$colour <- 
  with(artefacts_by_colour, 
       if_else(colour == "bleu", 
              "blue", 
              colour))

# inspect the result
artefacts_by_colour
## # A tibble: 7 x 2
##   colour     mass_kg
##   <chr>        <dbl>
## 1 red              3
## 2 " green"         5
## 3 greenish         8
## 4 green-like       4
## 5 blue             2
## 6 blue             1
## 7 "red "           7

The if_else() function is very useful for data cleaning because we can use it to easily update values in a column. It works by evaluating a condition for each item in a vector, one-by-one, and retuning a new vector with values that depend on how each item is evaluated. In this case the condition is colour == "bleu", which we can translate as ‘is the value of the column ’colour’ equivalent to “bleu”?‘. For each item in the ’colour’ column, the if_else() function will evaluate that condition and return either TRUE (the value is equalivant to “bleu”) or FALSE (the value is not “bleu”, but something else, like “red”, or “green”). The second argument to if_else() is the value to return if the condition is TRUE. In our example, we can translate this as ‘if the value of the column ’colour’ is “bleu”, then return the value “blue”‘. Or more plainly ’where “bleu” occurs in the ’colour’ column, replace it with “blue”‘, akin to a find-and-replace task you might do in a word processing document. The last argument to if_else() is the value to return if the condition is FALSE. In our example, this value is whatever the value of the ’colour’ column is. So when the if_else() gets to the last item in the ‘colour’ column, it sees the value “red”, and it evaluates the condition as ‘is “red” equivalent to “bleu”?’ and returns FALSE, and then skips over the second argument (what to return if TRUE) and looks at the third argument (what to return if FALSE), sees the column name ‘colour’ and returns the last value of that column, which is “red”. Effectively, it leaves the column values unchanged if the condition is FALSE.

Note the use of with() in the example above, it saves me from having to type the name of the data frame twice. Without with() I would have typed it like this, repeating the name of the data frame each time I refer to the ‘colour’ column:

artefacts_by_colour$colour <- 
       if_else(artefacts_by_colour$colour == "bleu", 
              "blue", 
               artefacts_by_colour$colour)

We could use a pair of if_else() functions, one nested inside the other, to change ‘greenish’ and ‘green-like’, but instead will will use a simpler option. We can use the function gsub(), which stands for ‘global substituion’, and works by searching through each element in a vector for a match to a pattern that we supply, and then when it finds a match to that pattern, substitutes the element with a replacement that we supply. Below, we supply that pattern to match as "green.*“, which translates as ‘the word green, followed by any character (indicated by the period), any number of times (indicated by the plus symbol)’. In this context, the period and plus symbol are metacharacters that have special meanings instead of their usual literal meanings. The pattern that we have supplied, using this metacharacters is called a ‘regular expression’. Working with regular expressions gives you access to a powerful, flexible, and efficient system for cleaning and manipulating data (see Fitzgerald (2012) and Friedl (2002) for detailed introductions). Regular expressions are used for processing strings in many programming languages, and are notorius for their terse and cryptic appearance. However, learning to work with regular expressions is a worthwhile investment because they can save a lot of time when cleaning data. The code below shows how we use gsub() with a regular expression to replace”greenish" and “green-like” with “green” in the ‘colour’ column:

artefacts_by_colour$colour <- 
  gsub("green.*", "green", artefacts_by_colour$colour)

# inspect the result
artefacts_by_colour
## # A tibble: 7 x 2
##   colour   mass_kg
##   <chr>      <dbl>
## 1 red            3
## 2 " green"       5
## 3 green          8
## 4 green          4
## 5 blue           2
## 6 blue           1
## 7 "red "         7

In the above example we show the use of the period and asterix as metacharacters in a regular expression. If we wanted to be more selective, we could use another regular expression character, the ‘or’ symbol |, to indicate that we only want to subsitute for a specific set of matches:

artefacts_by_colour$colour <- 
  gsub("greenish|green-like", "green", artefacts_by_colour$colour)

We can group characters used in regular expressions into four categories according to their function: converters, quantifiers, positioners, and operators. Converters are perhaps the most confusing of these, because it is not always obvious what they do and how many of them to use. Table 6.1 demonstrates this challenge where \\ seems to have two opposite functions. When working with \\ to convert characters in a regular expression, a useful rule of thumb is to keep adding adding backslashes until it works as desired.

Table 6.1: Characters commonly used for converting other characters in regular expressions
character meaning example
\\ convert a normal character into a metacharacter "\\s" matches any whitespace character (tabs, newlines and spaces); "\\S" matches any non-whitespace characters; "\\d"?matches any digit; "\\D" matches any non-digit character; "\\w" matches any word character (uppercase letters, lowercase letters, digits and underscore); "\\W" matches any non-word characters (note that we need to double-up the backslash in R, other languages use a single backslash in these contexts)
\\ convert a metacharacter into a normal (literal) character "\\." matches with a literal period; "\\(" matches with a literal left parenthesis

Quantifiers specify how many repetitions of the pattern we want to match in a string. Table 6.1 describes the commonly used quantifiers in R’s regular expressions.

Table 6.2: Characters commonly used for quantifing characters to match in regular expressions
character meaning example
* any number of repetition, including zero or more of some character/expression (metacharacter) "*" matches anything combination of characters
+ 1 or more repetitions; 1 or more of some character/expression (metacharacter) "[0-9]+" matches matches many at least digit 1 numbers such as ‘0’, ‘90’, or ‘021442132’
? expression is optional; 0/1 of some character/expression (metacharacter) "[Gg]ordon( [Vv]\\.)? [Cc]hilde" matches gordon childe’ and ‘Gordon V. Childe’ (two backslashes are needed to match with a literal period)
{n} matches exactly n times "ap{2}lique" matches ‘applique’ but not ‘aplique’ or ‘appplique’
{n,} matches at least n times "ap{2,}lique" matches ‘applique’ and ‘appplique’, but not ‘aplique’
{m, n} interval quantifier, allows specifying the minimum and maximum number of matches (metacharacter); {n,m} matches between n and m times. " [1-2]{1}[0-9]{3} " matches years in a sentence, i.e. a space, followed by 1 or 2, followed by three of any of the digits between 0 and 9, followed by a space. It will match ’ 1984 ‘, but not’ 5000 ’ or ’ 2e ’.

Positioners, or anchors, indicate the position of a pattern to match within a string. Table 6.3 summarisese some common positioners. These are especially useful for getting a match only at the start or end of a character string.

Table 6.3: Characters commonly used match at specific positions in regular expressions
character meaning example
^ start of the line (metacharacter) "^text" matches lines such as ‘text’
$ end of the line (metacharacter) "text$" matches lines such as ‘…text’
\b empty string at either edge of a word. Don’t confuse it with "^ $" which marks the edge of a string. "\\bpost" will match ‘post-processual’ and’ post hole’, but not ‘imposter’
\B empty string provided it is not at an edge of a word "\\Bpost" will match ‘imposter’ but not ‘post-processual’ and ‘post hole’

Operators are the core of a regular expression pattern, helping to define what to include in the pattern or exclude from it. Table 6.3 summarises some common positioners

Table 6.4: Operators commonly used to make regular expression patterns
character meaning example
. any character (metacharacter) "8.11" matches 8/11, 0811, 8-11, etc
[...] set of characters that will be accepted in the match (character class) "^[Ii]" matches both of these lines: ‘I dug at…’ or i dug at…’
[0-9] searches for a a range of characters (character class) "[a-zA-Z]" will match any letter in upper or lower case, "[0-9]" will match any digit
[^...] when used at beginning of character class, “^” means “not” (metacharacter) "[^?.]$" matches any line that does not end in ‘.’ or ‘?’. Will match ‘How old is it!’ but not ‘How old is it?’
| “or”, used to combine subexpressions called alternatives (metacharacter) "^([Nn]eolithic|[Pp]al(ae|e)olithic)" matches any character strings that start with lower/upper case ‘Neolithic’ and ‘Palaeolithic’ and ‘Paleolithic…’ (alternate spelling).
(...) define group as the the text in parentheses, groups will be remembered and can be referred to by \1, \2, etc. "([A-Z]\\.)", "\\1" matches any single capital letter, followed by a single literal period, anywhere in the character string (e.g. a person’s middle initial)

An good guide to regular expresions is Fitzgerald (2012), and for R in particular, the chapter on strings in Wickham and Grolemund (2016) is excellent (and is online here: http://r4ds.had.co.nz/strings.html). There are also many useful websites summarising common uses of regular expressions in R, such as http://stat545.com/block022_regular-expression.html, as well as sites to learn, build and test your regular expressions (e.g. http://www.regexr.com/, https://regex101.com/, and https://www.debuggex.com/). These sites are useful for learning more about regular expressions and understanding specific patterns, but it is easy to get confused with subtle differences in the way various programming languages interpret patterns on these websites. I find that there is no substitute for experimenting in my R console with trial and error on a very small example that represents the operation I want to apply to the full data set. Table 6.5 shows the R functions that I most commonly use when working with regular expressions in R.

Table 6.5: Functions commonly used to work with regular expressions. The stringr package has many other functions in addition to these that make data clearning easy and fast.
Task Functions
Identify match to a pattern grep(..., value = FALSE),grepl(),stringr::str_detect()
Extract match to a pattern grep(..., value = TRUE),stringr::str_extract(),stringr::str_extract_all()
Locate pattern within a string, i.e. give the start position of matched patterns stringr::str_locate(),string::str_locate_all()
Replace a pattern gsub(),stringr::str_replace(),stringr::str_replace_all()
Split a string using a pattern strsplit(),stringr::str_split()

Regular expressions are useful for many other data cleaning tasks besides fixing names, and we will see them pop up in other contexts. However, it can be templing to try to use regular expressions where another approach may be better suited. For example, you might try to write a complex regular expression where a series of simpler regular expressions or even if_else() statements might be easier to write and understand. When you get stuck with regular expressions, take a moment to reflect on simpler options, perhaps breaking the problem into smaller pieces and tackling each piece one at a time.

To return to our small table of artefacts, we are nearly done with fixing the problems in with the colour names. The data frame looks good when we view it in the console, but a final check with table(artefacts_by_colour$colour) reveals that we still have some inconsistences in some of the colour names.

# check the data frame
artefacts_by_colour
## # A tibble: 7 x 2
##   colour   mass_kg
##   <chr>      <dbl>
## 1 red            3
## 2 " green"       5
## 3 green          8
## 4 green          4
## 5 blue           2
## 6 blue           1
## 7 "red "         7
# final check to see if anything else needs fixing
table(artefacts_by_colour$colour)
## 
##  green   blue  green    red   red  
##      1      2      2      1      1

The output of table() shows that we still have two distinct names for ‘green’ and also for ‘red’, yet we cannot see any obvious differences in these names. To take a closer look, we must print the vector in isolation:

artefacts_by_colour$colour
## [1] "red"    " green" "green"  "green"  "blue"   "blue"   "red "

Now we can see the problem - there are leading and trailing spaces that are revealed by the double quotation marks. Those single spaces mean that R sees " green" as distinct from “green”. An easy and fast way to remove these is with the str_trim() function from Hadley Wickham’s stringr package. After that we can use table() to confirm that the colour names are as expected:

library(stringr)
artefacts_by_colour$colour <- str_trim(artefacts_by_colour$colour)

# check that the colour names are as expected
table(artefacts_by_colour$colour)
## 
##  blue green   red 
##     2     3     2

In situations where you have a large number of columns we can use map_df() to automatically apply functions such as str_trim() to all the columns in a data frame, for example:

map_df(artefacts_by_colour, ~str_trim(.x))
## # A tibble: 7 x 2
##   colour mass_kg
##   <chr>  <chr>  
## 1 red    3      
## 2 green  5      
## 3 green  8      
## 4 green  4      
## 5 blue   2      
## 6 blue   1      
## 7 red    7

However, this has the undesirable side-effect of coercing all columns into the character class, notice in the above output that the ‘mass_kg’ is now a character column. To trim the white space on only the character coloumns, andleave the numeric columns untouched, we can use map_if() and specify a condition that the columns must satisfy to determine if the function will be applied:

map_if(artefacts_by_colour, 
       is.character, 
       ~str_trim(.x)) %>% 
  as_data_frame()
## # A tibble: 7 x 2
##   colour mass_kg
##   <chr>    <dbl>
## 1 red          3
## 2 green        5
## 3 green        8
## 4 green        4
## 5 blue         2
## 6 blue         1
## 7 red          7

The condition we apply here is is.character() which tests to see if each column is character class. If the column is character class, then the str_trim() function is applied to it. The output from map_if() is a list, so we need to add as_data_frame() to convert the list back into a data frame, ready for the next step.

6.3.2 Converting column classes

As we saw in the last example, sometimes functions convert our columns to classes that are unexpected or at least not convienent for our analysis. This is a common issue that can cause frustrations with data anlysis, so it is worth taking some time to check your column classes periodically during data clearning and analysis, and being familiar with quick and easy methods to change column classes. For the simplest case, a single column, we can use as.numeric() to coerce a column of numbers from character class to numeric class. Conisider the previous table of artefact colours and masses, and imagine that one more artefact has been added to the data. Notice that the mass value has been mistakenly entered as ‘2.5.’ with an extra period at the end:

artefacts_by_colour <- rbind(artefacts_by_colour, c("yellow", "2.5."))

# have a look
artefacts_by_colour
## # A tibble: 8 x 2
##   colour mass_kg
##   <chr>  <chr>  
## 1 red    3      
## 2 green  5      
## 3 green  8      
## 4 green  4      
## 5 blue   2      
## 6 blue   1      
## 7 red    7      
## 8 yellow 2.5.

When we read this table into R, this ‘mass_kg’ column is now a character column and our attempts to do numerical operations, like calculate the average, will fail. In a simple case like this, where we can see digits but the class is character, as.numeric() is a good choice:

artefacts_by_colour$mass_kg <- as.numeric(artefacts_by_colour$mass_kg)

# see the result
artefacts_by_colour
## # A tibble: 8 x 2
##   colour mass_kg
##   <chr>    <dbl>
## 1 red          3
## 2 green        5
## 3 green        8
## 4 green        4
## 5 blue         2
## 6 blue         1
## 7 red          7
## 8 yellow      NA

The key thing to notice here is that while the ‘mass_kg’ column has been converted to a numeric (of the type ‘double’), the last item, which was ‘2.5.’ is now “NA”, which stands for ‘not available’. This is a special or ‘reserved’ word in the R language that represents a missing value. Is is important to be aware of missing values in your data because some common functions (such as mean()) will return only NA if there is even one missing value, unless you specifically instruct the function to remove the NA values (e.g. mean(x, na.rm = TRUE) will remove the NAs so that a mean can be calculated). In some cases data are genuinely missing, for example if the last item in the ‘mass_kg’ column was ‘??’, then all we know is that someone entered some data, but it’s not meaningful data about the mass of the artefact. But since we have ‘2.5.’ we can reasonably infer that the correct value is ‘2.5’. We can use a regular expresion to handle this situation, for example, here is a fairly comprehensive pattern that will extract numbers (including decimal values) from a wide variety of situations in a character vector:

library(stringr)

x <- c("2.5.", "-100.001", "1.1kg", "p.10", "10kg", "1.0p5", "1p0.5")
str_extract(x, "-?\\.?\\d+\\.*\\d*")
## [1] "2.5"      "-100.001" "1.1"      ".10"      "10"       "1.0"     
## [7] "1"

It is often helpful to annotate a complex regular expression to show the role of each character:

str_extract(x, "-?\\.?\\d+\\.*\\d*")
#                |  |  |   |  |
#                |  |  |   |  \\d* matches any digit, zero or more times
#                |  |  | \\.* matches a literal period, zero or more times
#                |  | \\d+ matches any digit one or more times
#                | \\.? matches a literal period, zero or one time, in case 
#                |      of the decimal with no leading zero
#               -? matches a negative sign, zero or one time 

This pattern captures numbers in several typical mis-entered forms, including when units are accidently added, or stray characters appear at the beginning or end of the number. It does not do well with more ambiguous cases such as the last two items where there is a character in the middle of the number. Indeed, even as humans we cannot easily determine if ‘1.0p5’ is meant to be ‘1.005’ or ‘1.05’. We would need to consider the context of the data recording in detail to make a decision about how to handle values like that (e.g. Was the instrument even capable of reporting a mass to the nearest 0.005 kg? Can we round up to the nearest 0.5 kg wihout any substantial loss of data?). Such a complex regular expression as the previous example may cause more problems that it solves, and we may prefer a simpler, case-by-case approach for handling badly formed values. To solve only the problem of unwanted trailing characters, such a ‘2.5.’ and ‘10kg’, we can use a simpler regular expresion:

x <- c("2.5.", "-100.001", "1.1kg", "p.10", "10kg", "1.0p5", "1p0.5")
str_extract(x, "\\d+\\.*\\d*")
## [1] "2.5"     "100.001" "1.1"     "10"      "10"      "1.0"     "1"

Here it is with annotations:

str_extract(x, "\\d+\\.*\\d*")
#                |   |  |
#                |   |  \\d* matches any digit, zero or more times
#                |  \\.* matches a literal period, zero or more times
#               \\d+ matches any digit one or more times

This works well for cases like ‘2.5.’ and ‘10kg’. Although it is not helpful for other types of mistakes, we might prefer to fix those with seperate regular expression functions. To return to our example data frame with ‘2.5.’ in the ‘mass_kg’ column, we can now extract a number from the commonly mis-typed values, and with as.numeric() we can coerce the column to numeric class, ready for analysis:

artefacts_by_colour$mass_kg <- 
  as.numeric(str_extract(artefacts_by_colour$mass_kg, 
                         "\\d+\\.*\\d*"))

# inspect the output
artefacts_by_colour
## # A tibble: 8 x 2
##   colour     mass_kg
##   <chr>        <dbl>
## 1 red            3  
## 2 " green"       5  
## 3 greenish       8  
## 4 green-like     4  
## 5 bleu           2  
## 6 blue           1  
## 7 "Red "         7  
## 8 yellow         2.5

In some situations, such as when you have much larger data frames, it may not be practical to attend to each mis-typed value and extract the numeric value, as we did in the example above. With a larger data set you may be more tolerant of missing values, and main task is simply to get all the columns into the most convienent class, and deal with the NA values later. The type_convert() function from the readr package solves this problem by automatically sampling a bunch of rows in each column and guessing what the class is, and then coercing the column to that class. Consider the ‘terry_table’ data frame that we worked on earlier in the chapter to fix the column names. Last we looked at it all the columns were character class, although columns 2-4 clearly contain numbers. We can use type_convert() to quickly and easily fix the column classes:

library(readr)

terry_table <- type_convert(terry_table)

# inspect the output
str(terry_table)
## Classes 'tbl_df', 'tbl' and 'data.frame':    37 obs. of  5 variables:
##  $ Sample                : chr  "O15-1" "10G-1" "N9-10" "10H-1" ...
##  $ Mehlich P (mg/kg)     : num  17.3 18.3 18.9 20.6 22 23 23 23.1 23.3 23.4 ...
##  $ Olsen P (mg/kg)       : num  6.5 6.5 9.4 4.5 7 10 17 15.4 12.8 9.2 ...
##  $ Total P (mg/kg)       : num  1762 1627 1496 1657 1143 ...
##  $ Ring Test Rating (1-6): num  2 2 3 4 3 4 4 3 4 2 ...

An especially nice detail about type_convert() is that is also automatically trims leading and trailing white spaces, solving the problem we saw above with " green" and “green”. Thetype_convert() function is automatically applied when you use any of the the read_*() functions from the readr package (e.g. read_csv() and read_table2()), so that can save some time and frustration by choosing one of those functions to import your data. Finally we have a table that is basically suitable for visualisation and analysis, so we should save it as a CSV ready for the next steps, and so we can deposit it in a repository where it can easilty be reused by others:

write_csv(terry_table, "data/data_output/terry_table.csv")

6.3.3 Splitting and combining columns

If we look carefully at the first column of ‘terry_table’, we see sample identifiers that area combination of letters and number, and an dash as a separator. During analysis it is often useful to have each item in this identifier is a separate column so we can do group-wise comparisons. This is a common situation when samples are labelled with a string that combines several pieces of provenance information, such as abbreviations for site, sector, square, exacavation unit, etc. For example, if we want to compare all the artefacts from one square to another square, then it is convienient to have the square identifier in its own column. The exact combination of items in these identifiers varies from project to project, so here we will explore some flexible techniques for splitting one column into several which will be useful in a variety of situations.

The simplest method is one we have seen in the previous chapter, separate() from the tidyr package:

library(tidyr) # load the library

terry_table_split <- 
terry_table %>%            # start with the data frame
  separate(Sample,         # the column to split
           c("a", "b"),    # new column names
           "-",            # character to split at
           convert = TRUE) # conver to numeric       

The separate() function takes a data frame as its first argument (here passed invisibly by the pipe), and then the name of the column that we want to split up, followed by the names of the new columns that will be created by the function (here simply ‘a’ and ‘b’, but could be something like ‘site’, ‘square’, ‘unit’, etc.), then the character to split (here it is ‘-’, but we can aslo supply a regular expression), and finally convert = TRUE which ensures that our column of numbers has the numeric class. Among the other arguments to this function, one that I use occasaionlly is remove = which by default removes the input column from the output data frame.

[x] 1. Fixing names: correting spelling, replacing spaces, using ifelse, gsub, regex
[x] 2. Types: character to numeric, extract numbers from strings 3. Splitting: separating one col into two or more, after the n-th item 4. Reshaping: wide <-> long

changing column names regex to clean data values, remove spaces getting data in the most useful object type (numeric, integer, character, factor, logical) ifelse to update/correct data values

splitting one column into two or more combining multiple columns into one for unique IDs adding new columns based on calculations of other columns

6.4 Wide and long data

Wide data - better for data entry Long data - better for data analysis and viz

tidyr gather/spread

6.5 Dealing with missing data

removing rows/columns with missing data imputing missing values by mean value, adjacent value, ifelse for arbitrary values filling in with previous values

6.6 Joining data together

bind_rows, bind_cols left_join

References

Fitzgerald, Michael. 2012. Introducing Regular Expressions. "O’Reilly Media, Inc.".

Friedl, Jeffrey EF. 2002. Mastering Regular Expressions. "O’Reilly Media, Inc.".

Jonge, Edwin de, and Mark van der Loo. 2013. “An Introduction to Data Cleaning with R.” Statistics Netherlands, the Hauge.

Matloff, Norman. 2011. The Art of R Programming: A Tour of Statistical Software Design. Book. No Starch Press.

Wickham, Hadley, and Garrett Grolemund. 2016. “R for Data Science.” Sebastopol, CA: O’Reilly. http://​ r4ds.​ had.​ co.​ nz.