Convert all Character variables to Factors
Introduction
First, let’s load up what we need!
set.seed(15102020)
library(tidyverse) #We'll use tidyverse functions
library(magrittr) #A few extra pipes from magrittr
library(lexicon) #For a word dictionary
When dealing wth complex datasets, it is common that a variable may be stored as a character variable, when in reality what you want is a factor variable. On the surface, these two constructs look very similar:
eg_df <- tibble(
c_var = c("Cat","Dog","Cat","Mouse","Mouse"),
f_var = factor(c("Cat","Dog","Cat","Mouse","Mouse"))
)
eg_df
## # A tibble: 5 x 2
## c_var f_var
## <chr> <fct>
## 1 Cat Cat
## 2 Dog Dog
## 3 Cat Cat
## 4 Mouse Mouse
## 5 Mouse Mouse
However, underneath they are treated quite differently. Behind the scenes, the factors are actually stored as integers with a special lookup table called their levels, which can be seen if we print the variable individually:
eg_df$f_var
## [1] Cat Dog Cat Mouse Mouse
## Levels: Cat Dog Mouse
We can also see the hidden numbers by converting this to numeric:
as.numeric(eg_df$f_var)
## [1] 1 2 1 3 3
The first element, Cat
is associated with the first level, so it is stored as a 1
, the third element is also Cat
, so it is also stored as a 1
. The fourth & fifth are both Mouse
and so they’re stored as 3
, indicating to use the third level.
Why factors?
Most statistical operations within R that can act on a character variable will essentially convert to a factor first. So, it’s more efficient to convert characters to factors before passing them into these kinds of functions. This also gives us more control over what we’re going to get.
This conversion makes many processes that work with characters a bit slow. If you’re wanting to do 20 functions on a data set and each one needs to convert your characters to factors internally before doing what it needs to, it’s clearly much faster to manually convert once before using these functions.
Factors also take up slightly less space in your system’s memory. In R, this is approximately half the space of a character, however the way R stores this kind of data is surprisingly efficient. It’s definitely a good habit to get into if you ever want to move onto less efficient storage methods.
Converting
Above, I used the factor()
function to quickly convert a single character variable to a factor variable. But what about if you’ve got a large dataset with many, many character variables that you want to convert to factors. What’s the smoothest way to do this?
Example random dataset
First, let’s create a large dataset, we’ll loop through a bunch of columns. We’ll use Fry’s 1000 Most Commonly Use English Words, as found in the sw_fry_1000
dataset from the {lexicon}
package to choose random words for each variable. We’ll also throw in some numeric variables to make things harder:
df <- tibble(id=1:1000) #declare a tibble with just an id variable
for(i in 1:10)
{
#How many distinct words should this variable have?
distinct_words <- round(rexp(1,1/20)) +1
#What words can we choose from for this variable?
these_words <- sample(sw_fry_1000,distinct_words)
#What's the name of this variable?
this_name <- paste0("var_",ncol(df) + 1)
#Generate the variable
this_variable <- sample(these_words,1000,replace=T)
#Store it in the tibble
df[[this_name]] <- this_variable
#Approximated 1/3 of the time, we'll add a numeric variable
if(rbinom(1,1,1/3) == 1){
this_name <- paste0("var_",ncol(df)+1)
df[[this_name]] <- rnorm(1000)
}
}
df
## # A tibble: 1,000 x 14
## id var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10 var_11 var_12
## <int> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 1 prop~ 1.83 row gove~ gene~ cow else women length 0.249 home
## 2 2 four -0.225 wind~ reas~ speak cow squa~ gold exerc~ 0.688 numer~
## 3 3 leave 0.367 gold plant came cow egg human exerc~ -0.517 tell
## 4 4 rock 0.919 that meat gene~ cow leave human skill -0.280 fill
## 5 5 favor -1.01 mile nine tree cow very hand has -0.0302 left
## 6 6 shop 1.14 hunt drink speak cow take meat hit 0.908 over
## 7 7 end 0.0427 engi~ seas~ gene~ cow art women exerc~ 0.0395 unit
## 8 8 favor -0.647 body drink gene~ cow diff~ doll~ most -0.458 people
## 9 9 earth -2.47 fight nine tree cow deci~ air king 0.0182 child
## 10 10 end 1.35 prot~ drink speak cow carry women grand -0.978 conti~
## # ... with 990 more rows, and 2 more variables: var_13 <chr>, var_14 <dbl>
The generation of this data is actually rather clunky as it’s using a loop, and we’re going to avoid that. Instead, we’re going to turn all these characters into factors in a single line. Here’s the line of code which will update the dataset, followed by the explanation:
The solution
With {tidyverse}
processes, the key thing we’re trying to do is build a “sentence” explaining what we’re doing. Here’s our expression, followed by the English sentence equivalent
df %<>% mutate(across(where(is.character),as_factor))
#Update the df by mutating it across variables where it is a
# character by performing as_factor on them
df
## # A tibble: 1,000 x 14
## id var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10 var_11 var_12
## <int> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <dbl> <fct>
## 1 1 prop~ 1.83 row gove~ gene~ cow else women length 0.249 home
## 2 2 four -0.225 wind~ reas~ speak cow squa~ gold exerc~ 0.688 numer~
## 3 3 leave 0.367 gold plant came cow egg human exerc~ -0.517 tell
## 4 4 rock 0.919 that meat gene~ cow leave human skill -0.280 fill
## 5 5 favor -1.01 mile nine tree cow very hand has -0.0302 left
## 6 6 shop 1.14 hunt drink speak cow take meat hit 0.908 over
## 7 7 end 0.0427 engi~ seas~ gene~ cow art women exerc~ 0.0395 unit
## 8 8 favor -0.647 body drink gene~ cow diff~ doll~ most -0.458 people
## 9 9 earth -2.47 fight nine tree cow deci~ air king 0.0182 child
## 10 10 end 1.35 prot~ drink speak cow carry women grand -0.978 conti~
## # ... with 990 more rows, and 2 more variables: var_13 <fct>, var_14 <dbl>
And as if by magic, all of the characters are now factors (note the <fct>
under the variable names).
The Explanation
The above code uses five functions, and an operation to perform the action. We’ll dig down into the functions and then climb back out as their results are processed:
%<>%
grabs the tibble on it’s left hand side and passes it to the function on the right. At this point, it works exactly like the regular%>%
operatormutate()
means we are creating or updating a variable inside the tibbleacross()
allows us perform a function across many variables within the tibblewhere()
allows us to specify where we wantacross()
to perform the functionis.character()
, in the above line, we don’t use the brackets foris.character()
because we’re not applying it, we’re referencing it. We’re telling thewhere()
function to use this when checking where we want the function to be applied. Theis.character()
function returnedTRUE
when the variable is a character andFALSE
when it isn’t (e.g. a numeric)
where()
therefore applies this function to every variable indf
and returns a vector ofTRUE
andFALSE
toacross()
to indicate which variables in the tibble we wantacross()
to act onas_factor()
converts things (e.g. characters) into factors.
across()
has now been passed a logical vector telling it which columns to apply a function and a function that it needs to apply. So it does just that and outputs another tibble
mutate()
has then been passed a tibble for it’s first argument (df
via the%<>%
pipe) and another tibble as the output ofacross()
. It stitches these together, if there are any names in common, it overwrites those indf
with those fromacross()
. All the variables inacross()
will also appear indf
because that’s where they came from, so the old values are overwritten with the new ones
%<>%
then receives this new tibble frommutate()
and stores it back into thedf
tibble that we originally passed to it. This is essentially saying thatdf %<>% f()
is the same asdf <- df %>% f()
, that’s why this is called the assignment pipe or updating pipe.