The lapply() family
This question on Reddit, got me thinking about the lapply()
family of functions, and how a beginner might want to learn about them. Here is my take
Introduction
The easiest one to understand is lapply()
, I’ll work through that and then extend to the others. As an aside, the programmatic terminology is vectorising as it allows us to perform an action over an entire vector at once or list in R.
Ignoring the dots, lapply()
takes two arguments X
and FUN
. FUN
is the name of the function, and X
is a list of objects. When I say list, this could be an actual, as created by the list()
function, or a vector such as 1:10
. But if you try to put something more complicated in like a data.frame()
, you can get unexpected results (I’ll come back to this).
lists
So, let’s say we have
X <- list(1:10,11:20,21:30)
X
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[2]]
## [1] 11 12 13 14 15 16 17 18 19 20
##
## [[3]]
## [1] 21 22 23 24 25 26 27 28 29 30
This list has three elements, and each element consists of a vector of 10 numbers. We can access them using [[
, where X[[1]]
will return the first element, the numbers 1 to 10:
X[[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
X[[2]]
will return the second element, etc… This is extraction as it extracts an element from a list. Extraction can only bring out a single element. We can also subset using [
for example
X[1:2]
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[2]]
## [1] 11 12 13 14 15 16 17 18 19 20
This return the first and second elements. X[1]
will return a subset consisting of the first element.
X[1]
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
What’s the difference between X[1]
and X[[1]]
? Well, X[1]
returns a list, which is just 1 element long, that element being a vector of the numbers from 1 to 10. X[[1]]
returns the actual element at position 1.
length(X[1])
## [1] 1
length(X[[1]])
## [1] 10
class(X[1])
## [1] "list"
class(X[[1]])
## [1] "integer"
So X[1]
is a list, just like X
but is shorter, a subset, just like how X[1:2]
is a subset with length 2. Whereas X[[1]]
is the first element of X
. This is clearer if we try to add something to these two objects:
X[[1]] + 3
## [1] 4 5 6 7 8 9 10 11 12 13
X[1] + 3
## Error in X[1] + 3: non-numeric argument to binary operator
Again, to stress the point. X[1]
is not a number, it is a list containing a single element. Since X[1]
is a list, we can therefore extract that first element from it:
X[1][[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
The difference between a list and a vector is that a vector has to all be of the same type (e.g. all characters as in c("a","b","c")
or all numbers as in c(1,2,3)
, the c()
function will coerce them otherwise, so c(1,"2",3)
will coerce to characters. But a list can all be different, so list("hello",2,1:10)
has three elements. In fact lists can contain lists (nested lists)
Y <- list("hello",1:10,list("one","two","three"))
Y
## [[1]]
## [1] "hello"
##
## [[2]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[3]]
## [[3]][[1]]
## [1] "one"
##
## [[3]][[2]]
## [1] "two"
##
## [[3]][[3]]
## [1] "three"
has three elements. If you extract the third element,
Y[[3]]
## [[1]]
## [1] "one"
##
## [[2]]
## [1] "two"
##
## [[3]]
## [1] "three"
you get another list. If you subset the third element,
Y[3]
## [[1]]
## [[1]][[1]]
## [1] "one"
##
## [[1]][[2]]
## [1] "two"
##
## [[1]][[3]]
## [1] "three"
you get a list with 1 element.
As far as nomenclature is concerned, a vector is a type of list which has the requirement that all entries be of the same type. You can even use extraction on a vector,
x <- 1:10
x[3]
## [1] 3
x[[3]]
## [1] 3
Although when dealing with a vector, the second version is much less common. The reason this works is that extraction and subsetting are essentially the same thing in a vector (because it will always return a vector, it just might be of length 1).
lapply()
So, now that we know what a list is, we can look at what lapply()
does to that list. If we supply a function, lapply()
will run that function on every element in that list. The simplest example would be, using X
from above, lapply(X,mean)
will return a list with the mean()
of every element in X
.
lapply(X,mean)
## [[1]]
## [1] 5.5
##
## [[2]]
## [1] 15.5
##
## [[3]]
## [1] 25.5
Remember that the elements in X
are the vectors of numbers, 1:10
, 11:20
and 21:30
. We’ve applied the function to the list a list-apply.
The function doesn’t have to be one that is named, and we can supply a function in-line
lapply(X, function(x) mean(x-5.5))
## [[1]]
## [1] 0
##
## [[2]]
## [1] 10
##
## [[3]]
## [1] 20
This applies the function function(x) mean(x-5.5)
to every element in X
. You could define this function outside of the lapply()
function earlier, but there is no need if this is the only place we plan on using it.
For future, when the R 4.1
version is released, I believe this will be even easier with the shorthand \()
syntax.
lapply(X, \(x) mean(x - 5.5))
So running lapply(X,FUN)
is the same as running the following for()
loop
output <- vector("list",length(X))
for(i in 1:length(X)){
output[[i]] <- FUN(X[[i]])
}
Compare the previous code to this:
output <- vector("list",length(X))
for(i in 1:length(X)){
output[[i]] <- mean(X[[i]])
}
output
## [[1]]
## [1] 5.5
##
## [[2]]
## [1] 15.5
##
## [[3]]
## [1] 25.5
Notice that I’ve defined the output <- vector("list",length(X))
before running the for()
loop. This line basically makes an empty list of the defined length. This will come up when we move on from lapply()
dots
One part of lapply()
that I’ve ignored is the ...
dots argument. These are basically other arguments that you want passed on to your function. Whatever is in the dots, will be passed to every call to FUN
, whether named or not:
lapply(c("a","b","c"),paste,"2")
## [[1]]
## [1] "a 2"
##
## [[2]]
## [1] "b 2"
##
## [[3]]
## [1] "c 2"
lapply(list( 1:10, c(1,2,NA,4), 21:30), mean, na.rm=T)
## [[1]]
## [1] 5.5
##
## [[2]]
## [1] 2.333333
##
## [[3]]
## [1] 25.5
Essentially, this runs the following loop:
X <- list( 1:10, c(1,2,NA,4), 21:30)
output <- vector("list",3)
for(i in 1:3){
output[[i]] <- mean(X[[i]],na.rm=T)
}
output
## [[1]]
## [1] 5.5
##
## [[2]]
## [1] 2.333333
##
## [[3]]
## [1] 25.5
Hopefully that will be enough to understand lapply()
. One unusual case is using lapply()
on a data.frame
-like structure. Now, a data.frame
looks like a table, but it’s actually a list, but the list is counter intuitive. Each element of the list is a column in the data.frame
. So, if you run the following, you would get a result that is only 4 elements long
iris0 <- iris[,1:4]
lapply(iris0,mean)
## $Sepal.Length
## [1] 5.843333
##
## $Sepal.Width
## [1] 3.057333
##
## $Petal.Length
## [1] 3.758
##
## $Petal.Width
## [1] 1.199333
You might think that this would work across the rows of the data.frame
, but it works down the columns. Also note that these outputs are now also named the same as the input list. This can be useful for keeping track of your inputs and outputs.
apply()
This brings us only apply()
.
The apply()
function does a similar job, however it doesn’t work on lists, it works on multi-dimensional objects, so matrices and arrays. It tries to collapse a multi-dimensional object down by one (or more) of its dimensions. So it turns a matrix into a vector (or an array into a smaller array). As well as X
(which must be multi-dimensional, so definitely not a list) and FUN
, it also takes MARGIN
which tells apply()
which dimension(s) to collapse:
M <- matrix(1:9,nrow=3)
apply(M,1,mean) #takes the mean of each row
## [1] 4 5 6
apply(M,2,mean) #takes the mean of each column
## [1] 2 5 8
The type returned is the same as the type we started with, and once again apply()
can take other arguments as dots. So this works quite well with character matrices:
{
M <- matrix(letters[1:9],nrow=3)
apply(M,1,paste0,collapse="") #pastes across the rows
## [1] "adg" "beh" "cfi"
apply(M,2,paste0,collapse="") #pastes down the columns
## [1] "abc" "def" "ghi"
This means we can use apply()
on a data.frame
to work across the rows, rather than down the columns. In this case, ever though a data.frame
is a list, because it can be accessed in the same way as a matrix, it still works
apply(iris0,1,mean)
## [1] 2.550 2.375 2.350 2.350 2.550 2.850 2.425 2.525 2.225 2.400 2.700 2.500
## [13] 2.325 2.125 2.800 3.000 2.750 2.575 2.875 2.675 2.675 2.675 2.350 2.650
## [25] 2.575 2.450 2.600 2.600 2.550 2.425 2.425 2.675 2.725 2.825 2.425 2.400
## [37] 2.625 2.500 2.225 2.550 2.525 2.100 2.275 2.675 2.800 2.375 2.675 2.350
## [49] 2.675 2.475 4.075 3.900 4.100 3.275 3.850 3.575 3.975 2.900 3.850 3.300
## [61] 2.875 3.650 3.300 3.775 3.350 3.900 3.650 3.400 3.600 3.275 3.925 3.550
## [73] 3.800 3.700 3.725 3.850 3.950 4.100 3.725 3.200 3.200 3.150 3.400 3.850
## [85] 3.600 3.875 4.000 3.575 3.500 3.325 3.425 3.775 3.400 2.900 3.450 3.525
## [97] 3.525 3.675 2.925 3.475 4.525 3.875 4.525 4.150 4.375 4.825 3.400 4.575
## [109] 4.200 4.850 4.200 4.075 4.350 3.800 4.025 4.300 4.200 5.100 4.875 3.675
## [121] 4.525 3.825 4.800 3.925 4.450 4.550 3.900 3.950 4.225 4.400 4.550 5.025
## [133] 4.250 3.925 3.925 4.775 4.425 4.200 3.900 4.375 4.450 4.350 3.875 4.550
## [145] 4.550 4.300 3.925 4.175 4.325 3.950
now gives a vector of the averages of each row.
Rest of the family
Now for lapply()
’s sisters:
vapply()
takes an extra argument, which is of the same type as what you want your outcome to be. This is the one that I use most often. You can think of it like a lapply()
that will output something other than a list. I usually give FUN.VALUE
as something like integer(1)
or character(1)
. These functions generate empty vectors of that type, they are wrappers around things like vector("integer",1)
X <- list(1:10,11:20,21:30)
vapply(X,mean,numeric(1))
## [1] 5.5 15.5 25.5
This time, we get a numeric vector, rather than a list like we would with lapply()
. I find this much easier to ensure I’m working with the correct type of data.
sapply()
tries to simplify your output, So if lapply()
outputs a list of vectors that are all the same length, instead of a list, it’ll return a matrix
X <- list(1:5, 6:10, 11:15)
sapply(X,range)
## [,1] [,2] [,3]
## [1,] 1 6 11
## [2,] 5 10 15
Each column in this result is the same as one of the elements of the list lapply(X,range)
. They’ve just been cbind
’d together. The use of sapply()
is not common as the output can be inconsistent. `vapply()
is much prefered as it gives more control over the output. The above can be replicated with vapply()
and will throw an error if the output is unexpected:
X <- list(1:5, 6:10, 11:15)
vapply(X,range,numeric(2))
## [,1] [,2] [,3]
## [1,] 1 6 11
## [2,] 5 10 15
tapply()
is more complicated as it subsets the X
based on the INDEX
. It describes this as a “Ragged Array”. I have never used this directly, as I will usually do the subsetting manually using split()
, but that is essentialy what tapply()
does behind the scenes. tapply()
also comes with a simplify
argument, which decides whether R will try and simplify the results, like in sapply()
or not, by default it will try and invoke this simplification. The following are therefore (roughly) equivalent
tapply(X, INDEX, FUN, simplify=FALSE)
lapply(split(X,INDEX), FUN)
split()
creates a list where the first vector is split into groups based on the second argument.
So we can compare using both a lapply()
and a vapply()
x <- 1:10
grp <- c(1,1,1,2,2,3,3,3,4,5)
tapply(x,grp,sum,simplify=FALSE)
## $`1`
## [1] 6
##
## $`2`
## [1] 9
##
## $`3`
## [1] 21
##
## $`4`
## [1] 9
##
## $`5`
## [1] 10
lapply(split(x,grp),sum)
## $`1`
## [1] 6
##
## $`2`
## [1] 9
##
## $`3`
## [1] 21
##
## $`4`
## [1] 9
##
## $`5`
## [1] 10
tapply(x,grp,sum)
## 1 2 3 4 5
## 6 9 21 9 10
vapply(split(x,grp),sum,numeric(1))
## 1 2 3 4 5
## 6 9 21 9 10
The other member of the lapply()
family is mapply()
. This is even more powerful as it allows you to vectorise over multiple arguments, rather than just the first. Syntactically, the difference here is that the dots are the vectorised arguments, and the non-vectorised arguments go into the MoreArgs
argument.
X <- list("one","two",c("three", "four"))
Y <- list("A","B",c("C","D"))
mapply(paste,X,Y)
## [[1]]
## [1] "one A"
##
## [[2]]
## [1] "two B"
##
## [[3]]
## [1] "three C" "four D"
This is the same as doing:
list(
paste(X[[1]],Y[[1]]),
paste(X[[2]],Y[[2]]),
paste(X[[3]],Y[[3]])
)
## [[1]]
## [1] "one A"
##
## [[2]]
## [1] "two B"
##
## [[3]]
## [1] "three C" "four D"
Here is one final example using rep()
, which repeats the first argument a specific number of times
X <- letters[1:4]
Y <- 1:4
mapply(rep,X,Y)
## $a
## [1] "a"
##
## $b
## [1] "b" "b"
##
## $c
## [1] "c" "c" "c"
##
## $d
## [1] "d" "d" "d" "d"