Convert all Character variables to Factors

Introduction

First, let’s load up what we need!

set.seed(15102020)
library(tidyverse) #We'll use tidyverse functions
library(magrittr) #A few extra pipes from magrittr
library(lexicon) #For a word dictionary

When dealing wth complex datasets, it is common that a variable may be stored as a character variable, when in reality what you want is a factor variable. On the surface, these two constructs look very similar:

eg_df <- tibble(
  c_var = c("Cat","Dog","Cat","Mouse","Mouse"),
  f_var = factor(c("Cat","Dog","Cat","Mouse","Mouse"))
)
eg_df
## # A tibble: 5 x 2
##   c_var f_var
##   <chr> <fct>
## 1 Cat   Cat  
## 2 Dog   Dog  
## 3 Cat   Cat  
## 4 Mouse Mouse
## 5 Mouse Mouse

However, underneath they are treated quite differently. Behind the scenes, the factors are actually stored as integers with a special lookup table called their levels, which can be seen if we print the variable individually:

eg_df$f_var
## [1] Cat   Dog   Cat   Mouse Mouse
## Levels: Cat Dog Mouse

We can also see the hidden numbers by converting this to numeric:

as.numeric(eg_df$f_var)
## [1] 1 2 1 3 3

The first element, Cat is associated with the first level, so it is stored as a 1, the third element is also Cat, so it is also stored as a 1. The fourth & fifth are both Mouse and so they’re stored as 3, indicating to use the third level.

Why factors?

Most statistical operations within R that can act on a character variable will essentially convert to a factor first. So, it’s more efficient to convert characters to factors before passing them into these kinds of functions. This also gives us more control over what we’re going to get.

This conversion makes many processes that work with characters a bit slow. If you’re wanting to do 20 functions on a data set and each one needs to convert your characters to factors internally before doing what it needs to, it’s clearly much faster to manually convert once before using these functions.

Factors also take up slightly less space in your system’s memory. In R, this is approximately half the space of a character, however the way R stores this kind of data is surprisingly efficient. It’s definitely a good habit to get into if you ever want to move onto less efficient storage methods.

Converting

Above, I used the factor() function to quickly convert a single character variable to a factor variable. But what about if you’ve got a large dataset with many, many character variables that you want to convert to factors. What’s the smoothest way to do this?

Example random dataset

First, let’s create a large dataset, we’ll loop through a bunch of columns. We’ll use Fry’s 1000 Most Commonly Use English Words, as found in the sw_fry_1000 dataset from the {lexicon} package to choose random words for each variable. We’ll also throw in some numeric variables to make things harder:

df <- tibble(id=1:1000) #declare a tibble with just an id variable
for(i in 1:10)
{
  #How many distinct words should this variable have?
  distinct_words <- round(rexp(1,1/20)) +1
  
  #What words can we choose from for this variable?
  these_words <- sample(sw_fry_1000,distinct_words)
  
  #What's the name of this variable?
  this_name <- paste0("var_",ncol(df) + 1)
  
  #Generate the variable
  this_variable <- sample(these_words,1000,replace=T)
  
  #Store it in the tibble
  df[[this_name]] <- this_variable
  
  #Approximated 1/3 of the time, we'll add a numeric variable
  if(rbinom(1,1,1/3) == 1){
    this_name <- paste0("var_",ncol(df)+1)
    
    df[[this_name]] <- rnorm(1000)
  }
  
}
df
## # A tibble: 1,000 x 14
##       id var_2   var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10  var_11 var_12
##    <int> <chr>   <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>    <dbl> <chr> 
##  1     1 prop~  1.83   row   gove~ gene~ cow   else  women length  0.249  home  
##  2     2 four  -0.225  wind~ reas~ speak cow   squa~ gold  exerc~  0.688  numer~
##  3     3 leave  0.367  gold  plant came  cow   egg   human exerc~ -0.517  tell  
##  4     4 rock   0.919  that  meat  gene~ cow   leave human skill  -0.280  fill  
##  5     5 favor -1.01   mile  nine  tree  cow   very  hand  has    -0.0302 left  
##  6     6 shop   1.14   hunt  drink speak cow   take  meat  hit     0.908  over  
##  7     7 end    0.0427 engi~ seas~ gene~ cow   art   women exerc~  0.0395 unit  
##  8     8 favor -0.647  body  drink gene~ cow   diff~ doll~ most   -0.458  people
##  9     9 earth -2.47   fight nine  tree  cow   deci~ air   king    0.0182 child 
## 10    10 end    1.35   prot~ drink speak cow   carry women grand  -0.978  conti~
## # ... with 990 more rows, and 2 more variables: var_13 <chr>, var_14 <dbl>

The generation of this data is actually rather clunky as it’s using a loop, and we’re going to avoid that. Instead, we’re going to turn all these characters into factors in a single line. Here’s the line of code which will update the dataset, followed by the explanation:

The solution

With {tidyverse} processes, the key thing we’re trying to do is build a “sentence” explaining what we’re doing. Here’s our expression, followed by the English sentence equivalent

df %<>% mutate(across(where(is.character),as_factor))
#Update the df by mutating it across variables where it is a
#   character by performing as_factor on them

df
## # A tibble: 1,000 x 14
##       id var_2   var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10  var_11 var_12
##    <int> <fct>   <dbl> <fct> <fct> <fct> <fct> <fct> <fct> <fct>    <dbl> <fct> 
##  1     1 prop~  1.83   row   gove~ gene~ cow   else  women length  0.249  home  
##  2     2 four  -0.225  wind~ reas~ speak cow   squa~ gold  exerc~  0.688  numer~
##  3     3 leave  0.367  gold  plant came  cow   egg   human exerc~ -0.517  tell  
##  4     4 rock   0.919  that  meat  gene~ cow   leave human skill  -0.280  fill  
##  5     5 favor -1.01   mile  nine  tree  cow   very  hand  has    -0.0302 left  
##  6     6 shop   1.14   hunt  drink speak cow   take  meat  hit     0.908  over  
##  7     7 end    0.0427 engi~ seas~ gene~ cow   art   women exerc~  0.0395 unit  
##  8     8 favor -0.647  body  drink gene~ cow   diff~ doll~ most   -0.458  people
##  9     9 earth -2.47   fight nine  tree  cow   deci~ air   king    0.0182 child 
## 10    10 end    1.35   prot~ drink speak cow   carry women grand  -0.978  conti~
## # ... with 990 more rows, and 2 more variables: var_13 <fct>, var_14 <dbl>

And as if by magic, all of the characters are now factors (note the <fct> under the variable names).

The Explanation

The above code uses five functions, and an operation to perform the action. We’ll dig down into the functions and then climb back out as their results are processed:

  • %<>% grabs the tibble on it’s left hand side and passes it to the function on the right. At this point, it works exactly like the regular %>% operator
    • mutate() means we are creating or updating a variable inside the tibble
      • across() allows us perform a function across many variables within the tibble
        • where() allows us to specify where we want across() to perform the function
          • is.character(), in the above line, we don’t use the brackets for is.character() because we’re not applying it, we’re referencing it. We’re telling the where() function to use this when checking where we want the function to be applied. The is.character() function returned TRUE when the variable is a character and FALSE when it isn’t (e.g. a numeric)
        • where() therefore applies this function to every variable in df and returns a vector of TRUE and FALSE to across() to indicate which variables in the tibble we want across() to act on
        • as_factor() converts things (e.g. characters) into factors.
      • across() has now been passed a logical vector telling it which columns to apply a function and a function that it needs to apply. So it does just that and outputs another tibble
    • mutate() has then been passed a tibble for it’s first argument (df via the %<>% pipe) and another tibble as the output of across(). It stitches these together, if there are any names in common, it overwrites those in df with those from across(). All the variables in across() will also appear in df because that’s where they came from, so the old values are overwritten with the new ones
  • %<>% then receives this new tibble from mutate() and stores it back into the df tibble that we originally passed to it. This is essentially saying that df %<>% f() is the same as df <- df %>% f(), that’s why this is called the assignment pipe or updating pipe.
Michael Barrowman
Michael Barrowman
Data Scientist

I am a Data Scientist, PhD candidate and Educator.