<- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70) inches
This lecture, as the rest of the course, is adapted from the version Stephanie C. Hicks designed and maintained in 2021 and 2022. Check the recent changes to this file through the GitHub history.
Pre-lecture materials
Read ahead
Before class, you can prepare by reading the following materials:
Acknowledgements
Material for this lecture was borrowed and adopted from
Learning objectives
At the end of this lesson you will:
- Understand how to perform vector arithmetics in R
- Implement the 5 functional loops in R (vs e.g. for loops) in R
Vectorization
Writing for
and while
loops are useful and easy to understand, but in R we rarely use them.
As you learn more R, you will realize that vectorization is preferred over for-loops since it results in shorter and clearer code.
Vector arithmetics
Rescaling a vector
In R, arithmetic operations on vectors occur element-wise. For a quick example, suppose we have height in inches:
and want to convert to centimeters.
Notice what happens when we multiply inches by 2.54:
* 2.54 inches
[1] 175.26 157.48 167.64 177.80 177.80 185.42 170.18 185.42 170.18 177.80
In the line above, we multiplied each element by 2.54.
Similarly, if for each entry we want to compute how many inches taller or shorter than 69 inches (the average height for males), we can subtract it from every entry like this:
- 69 inches
[1] 0 -7 -3 1 1 4 -2 4 -2 1
Two vectors
If we have two vectors of the same length, and we sum them in R, they will be added entry by entry as follows:
<- 1:10
x <- 1:10
y + y x
[1] 2 4 6 8 10 12 14 16 18 20
The same holds for other mathematical operations, such as -
, *
and /
.
<- 1:10
x sqrt(x)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
[9] 3.000000 3.162278
<- 1:10
y * y x
[1] 1 4 9 16 25 36 49 64 81 100
That is, we donβt write a for
loop that takes each element of x
and each element of y
, adds, them up, then puts them all in a single vector.
## Check that x and y have the same length
stopifnot(length(x) == length(y))
## Create our result object
<- vector(mode = "integer", length = length(x))
result ## Loop through each element of x and y, calculate the sum,
## then store it on 'result'
for (i in seq_along(x)) {
<- x[i] + y[i]
result[i]
}## Check that we got the same answer
identical(result, x + y)
[1] TRUE
Functional loops
While for
loops are perfectly valid, when you use vectorization in an element-wise fashion, there is no need for for
loops because we can apply what are called functional loops.
Functional loops are functions that help us apply the same function to each entry in a vector, matrix, data frame, or list. Here are a list of them in base R:
lapply()
: Loop over a list and evaluate a function on each elementsapply()
: Same aslapply
but try to simplify the resultapply()
: Apply a function over the margins of an arraytapply()
: Apply a function over subsets of a vectormapply()
: Multivariate version oflapply
(wonβt cover)
An auxiliary function split()
is also useful, particularly in conjunction with lapply()
.
Define a function
<- function(a, b) {
my_sum + b
a
}
## Same but with an extra check to make sure that 'a' and 'b'
## have the same lengths.
<- function(a, b) {
my_sum ## Check that a and b are of the same length
stopifnot(length(a) == length(b))
+ b
a }
Apply your function
Here since we have two input vectors, we need to use mapply()
which is one of the more complex functions. Note the weird order of the arguments where the function we will mapply()
over comes before the inputs to said function.
## Check the arguments to mapply()
args(mapply)
function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
NULL
## Apply mapply() to our function my_sum() with the inputs 'x' and 'y'
mapply(sum776::my_sum, x, y)
[1] 2 4 6 8 10 12 14 16 18 20
## Or write an anynymous function that is:
## * not documented
## * not tested
## * not shared
##
## :(
mapply(function(a, b) {
+ b
a }, x, y)
[1] 2 4 6 8 10 12 14 16 18 20
purrr
alternative
The purrr
package, which is part of the tidyverse
, provides an alternative framework to the apply
family of functions in base R.
library("purrr") ## part of tidyverse
## Check the arguments of map2_int()
args(purrr::map2_int)
function (.x, .y, .f, ..., .progress = FALSE)
NULL
## Apply our function my_sum() to our inputs
::map2_int(x, y, sum776::my_sum) purrr
[1] 2 4 6 8 10 12 14 16 18 20
## You can also use anonymous functions
::map2_int(x, y, function(a, b) {
purrr+ b
a })
[1] 2 4 6 8 10 12 14 16 18 20
## purrr even has a super short formula-like syntax
## where .x is the first input and .y is the second one
::map2_int(x, y, ~ .x + .y) purrr
[1] 2 4 6 8 10 12 14 16 18 20
## This formula syntax has nothing to do with the objects 'x' and 'y'
::map2_int(1:2, 3:4, ~ .x + .y) purrr
[1] 4 6
Base R loops
lapply()
The lapply()
function does the following simple series of operations:
- it loops over a list, iterating over each element in that list
- it applies a function to each element of the list (a function that you specify)
- and returns a list (the
l
inlapply()
is for βlistβ).
This function takes three arguments: (1) a list X
; (2) a function (or the name of a function) FUN
; (3) other arguments via its ...
argument. If X
is not a list, it will be coerced to a list using as.list()
.
The body of the lapply()
function can be seen here.
lapply
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
.Internal(lapply(X, FUN))
}
<bytecode: 0x15c1335d0>
<environment: namespace:base>
The actual looping is done internally in C code for efficiency reasons.
It is important to remember that lapply()
always returns a list, regardless of the class of the input.
Hereβs an example of applying the mean()
function to all elements of a list. If the original list has names, the the names will be preserved in the output.
<- list(a = 1:5, b = rnorm(10))
x x
$a
[1] 1 2 3 4 5
$b
[1] 0.9722200 0.6674662 0.7846470 -0.2022569 -0.4246955 -0.1146525
[7] 1.0289155 1.7111564 1.2365643 0.2786648
lapply(x, mean)
$a
[1] 3
$b
[1] 0.5938029
Notice that here we are passing the mean()
function as an argument to the lapply()
function.
::map_dbl(x, mean) purrr
a b
3.0000000 0.5938029
What difference do you notice in terms of the output of lapply()
and purrr::map_dbl()
?
::map(x, mean) purrr
$a
[1] 3
$b
[1] 0.5938029
Functions in R can be used this way and can be passed back and forth as arguments just like any other object in R.
When you pass a function to another function, you do not need to include the open and closed parentheses ()
like you do when you are calling a function.
Here is another example of using lapply()
.
<- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))
x lapply(x, mean)
$a
[1] 2.5
$b
[1] 0.002605756
$c
[1] 0.930439
$d
[1] 5.101692
You can use lapply()
to evaluate a function multiple times each with a different argument.
Next is an example where I call the runif()
function (to generate uniformly distributed random variables) four times, each time generating a different number of random numbers.
<- 1:4
x lapply(x, runif)
[[1]]
[1] 0.1590025
[[2]]
[1] 0.7318223 0.6817140
[[3]]
[1] 0.7664481 0.2030166 0.2405505
[[4]]
[1] 0.4222646 0.4648125 0.6745461 0.9553146
When you pass a function to lapply()
, lapply()
takes elements of the list and passes them as the first argument of the function you are applying.
In the above example, the first argument of runif()
is n
, and so the elements of the sequence 1:4
all got passed to the n
argument of runif()
.
This also works with purrr
functions.
::map(x, runif) purrr
[[1]]
[1] 0.1129995
[[2]]
[1] 0.5138907 0.3849256
[[3]]
[1] 0.5426743 0.1539097 0.9295260
[[4]]
[1] 0.2136414 0.6829276 0.1135764 0.1645305
Functions that you pass to lapply()
may have other arguments. For example, the runif()
function has a min
and max
argument too.
In the example above I used the default values for min
and max
.
- How would you be able to specify different values for that in the context of
lapply()
?
Here is where the ...
argument to lapply()
comes into play. Any arguments that you place in the ...
argument will get passed down to the function being applied to the elements of the list.
Here, the min = 0
and max = 10
arguments are passed down to runif()
every time it gets called.
<- 1:4
x lapply(x, runif, min = 0, max = 10)
[[1]]
[1] 0.9344134
[[2]]
[1] 2.812775 5.972051
[[3]]
[1] 5.788086 7.568036 2.865283
[[4]]
[1] 9.554248 9.098009 7.881458 2.779307
So now, instead of the random numbers being between 0 and 1 (the default), the are all between 0 and 10. Again, this also works with purrr
functions.
::map(x, runif, min = 0, max = 10) purrr
[[1]]
[1] 5.169658
[[2]]
[1] 8.415324 8.916783
[[3]]
[1] 1.512034 7.947503 5.909642
[[4]]
[1] 6.654542 1.721982 2.589741 2.074400
The lapply()
function (and its friends) makes heavy use of anonymous functions. Anonymous functions are like members of Project Mayhemβthey have no names. These functions are generated βon the flyβ as you are using lapply()
. Once the call to lapply()
is finished, the function disappears and does not appear in the workspace.
Here I am creating a list that contains two matrices.
<- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2))
x x
$a
[,1] [,2]
[1,] 1 3
[2,] 2 4
$b
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
Suppose I wanted to extract the first column of each matrix in the list. I could write an anonymous function for extracting the first column of each matrix.
lapply(x, function(elt) {
1]
elt[, })
$a
[1] 1 2
$b
[1] 1 2 3
Notice that I put the function()
definition right in the call to lapply()
.
This is perfectly legal and acceptable. You can put an arbitrarily complicated function definition inside lapply()
, but if itβs going to be more complicated, itβs probably a better idea to define the function separately.
For example, I could have done the following.
<- function(elt) {
f 1]
elt[,
}lapply(x, f)
$a
[1] 1 2
$b
[1] 1 2 3
Now the function is no longer anonymous; its name is f
.
Whether you use an anonymous function or you define a function first depends on your context. If you think the function f
is something you are going to need a lot in other parts of your code, you might want to define it separately. But if you are just going to use it for this call to lapply()
, then it is probably simpler to use an anonymous function.
sapply()
The sapply()
function behaves similarly to lapply()
; the only real difference is in the return value. sapply()
will try to simplify the result of lapply()
if possible. Essentially, sapply()
calls lapply()
on its input and then applies the following algorithm:
If the result is a list where every element is length 1, then a vector is returned
If the result is a list where every element is a vector of the same length (> 1), a matrix is returned.
If it canβt figure things out, a list is returned
- This can be a source of many headaches and one of the main motivations behind the
purrr
package. Withpurrr
you know exactly what type of output you are getting!
- This can be a source of many headaches and one of the main motivations behind the
Hereβs the result of calling lapply()
.
<- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))
x lapply(x, mean)
$a
[1] 2.5
$b
[1] 0.0286998
$c
[1] 1.373758
$d
[1] 4.986047
Notice that lapply()
returns a list (as usual), but that each element of the list has length 1.
Hereβs the result of calling sapply()
on the same list.
sapply(x, mean)
a b c d
2.5000000 0.0286998 1.3737582 4.9860467
Because the result of lapply()
was a list where each element had length 1, sapply()
collapsed the output into a numeric vector, which is often more useful than a list.
With purrr
, if I want a list output, I use map()
. If I want a double (numeric) output, we can use map_dbl()
.
::map(x, mean) purrr
$a
[1] 2.5
$b
[1] 0.0286998
$c
[1] 1.373758
$d
[1] 4.986047
::map_dbl(x, mean) purrr
a b c d
2.5000000 0.0286998 1.3737582 4.9860467
split()
The split()
function takes a vector or other objects and splits it into groups determined by a factor or list of factors.
The arguments to split()
are
str(split)
function (x, f, drop = FALSE, ...)
where
x
is a vector (or list) or data framef
is a factor (or coerced to one) or a list of factorsdrop
indicates whether empty factors levels should be dropped
The combination of split()
and a function like lapply()
or sapply()
is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying that function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as βmap-reduceβ in other contexts.
Here we simulate some data and split it according to a factor variable. Note that we use the gl()
function to βgenerate levelsβ in a factor variable.
<- c(rnorm(10), runif(10), rnorm(10, 1))
x <- gl(3, 10) # generate factor levels
f f
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Levels: 1 2 3
split(x, f)
$`1`
[1] 1.47261027 -0.11778728 1.22907340 -0.36511061 0.62913588 1.34891088
[7] 0.15830704 -0.30661119 0.81886489 0.01370576
$`2`
[1] 0.23388354 0.67620425 0.01498285 0.76791031 0.02144240 0.08805958
[7] 0.54305526 0.06950228 0.83105730 0.25459071
$`3`
[1] -0.6992556 -0.4615856 3.5409743 1.1184335 0.5501715 0.6915062
[7] 1.4495427 2.0629720 2.1100236 0.3229176
A common idiom is split
followed by an lapply
.
lapply(split(x, f), mean)
$`1`
[1] 0.4881099
$`2`
[1] 0.3500688
$`3`
[1] 1.06857
Splitting a Data Frame
library("datasets")
head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
We can split the airquality
data frame by the Month
variable so that we have separate sub-data frames for each month.
<- split(airquality, airquality$Month)
s str(s)
List of 5
$ 5:'data.frame': 31 obs. of 6 variables:
..$ Ozone : int [1:31] 41 36 12 18 NA 28 23 19 8 NA ...
..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 ...
..$ Wind : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
..$ Temp : int [1:31] 67 72 74 62 56 66 65 59 61 69 ...
..$ Month : int [1:31] 5 5 5 5 5 5 5 5 5 5 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 6:'data.frame': 30 obs. of 6 variables:
..$ Ozone : int [1:30] NA NA NA NA NA NA 29 NA 71 39 ...
..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 ...
..$ Wind : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 ...
..$ Temp : int [1:30] 78 74 67 84 85 79 82 87 90 87 ...
..$ Month : int [1:30] 6 6 6 6 6 6 6 6 6 6 ...
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
$ 7:'data.frame': 31 obs. of 6 variables:
..$ Ozone : int [1:31] 135 49 32 NA 64 40 77 97 97 85 ...
..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 ...
..$ Wind : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 ...
..$ Temp : int [1:31] 84 85 81 84 83 83 88 92 92 89 ...
..$ Month : int [1:31] 7 7 7 7 7 7 7 7 7 7 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 8:'data.frame': 31 obs. of 6 variables:
..$ Ozone : int [1:31] 39 9 16 78 35 66 122 89 110 NA ...
..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 ...
..$ Wind : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 ...
..$ Temp : int [1:31] 81 81 82 86 85 87 89 90 90 92 ...
..$ Month : int [1:31] 8 8 8 8 8 8 8 8 8 8 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 9:'data.frame': 30 obs. of 6 variables:
..$ Ozone : int [1:30] 96 78 73 91 47 32 20 23 21 24 ...
..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 ...
..$ Wind : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 ...
..$ Temp : int [1:30] 91 92 93 93 87 84 80 78 75 73 ...
..$ Month : int [1:30] 9 9 9 9 9 9 9 9 9 9 ...
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
Then we can take the column means for Ozone
, Solar.R
, and Wind
for each sub-data frame.
lapply(s, function(x) {
colMeans(x[, c("Ozone", "Solar.R", "Wind")])
})
$`5`
Ozone Solar.R Wind
NA NA 11.62258
$`6`
Ozone Solar.R Wind
NA 190.16667 10.26667
$`7`
Ozone Solar.R Wind
NA 216.483871 8.941935
$`8`
Ozone Solar.R Wind
NA NA 8.793548
$`9`
Ozone Solar.R Wind
NA 167.4333 10.1800
Using sapply()
might be better here for a more readable output.
sapply(s, function(x) {
colMeans(x[, c("Ozone", "Solar.R", "Wind")])
})
5 6 7 8 9
Ozone NA NA NA NA NA
Solar.R NA 190.16667 216.483871 NA 167.4333
Wind 11.62258 10.26667 8.941935 8.793548 10.1800
Unfortunately, there are NA
s in the data so we cannot simply take the means of those variables. However, we can tell the colMeans
function to remove the NA
s before computing the mean.
sapply(s, function(x) {
colMeans(x[, c("Ozone", "Solar.R", "Wind")],
na.rm = TRUE
) })
5 6 7 8 9
Ozone 23.61538 29.44444 59.115385 59.961538 31.44828
Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
Wind 11.62258 10.26667 8.941935 8.793548 10.18000
We can also do this with purrr
as shown below.
::map(s, function(x) {
purrrcolMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE)
})
$`5`
Ozone Solar.R Wind
23.61538 181.29630 11.62258
$`6`
Ozone Solar.R Wind
29.44444 190.16667 10.26667
$`7`
Ozone Solar.R Wind
59.115385 216.483871 8.941935
$`8`
Ozone Solar.R Wind
59.961538 171.857143 8.793548
$`9`
Ozone Solar.R Wind
31.44828 167.43333 10.18000
The above is not as condensed as the sapply()
output. We can use the superseded (aka no longer supported) function map_dfc()
:
::map_dfc(s, function(x) {
purrrcolMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE)
})
# A tibble: 3 Γ 5
`5` `6` `7` `8` `9`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 23.6 29.4 59.1 60.0 31.4
2 181. 190. 216. 172. 167.
3 11.6 10.3 8.94 8.79 10.2
Or use the currently supported function purrr::list_cbind()
. Though we also need to do a bit more work behind the scenes.
## Make sure we get data.frame / tibble outputs for each element
## of the list
:::map(s, function(x) {
purrr::as_tibble(colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE))
tibble })
$`5`
# A tibble: 3 Γ 1
value
<dbl>
1 23.6
2 181.
3 11.6
$`6`
# A tibble: 3 Γ 1
value
<dbl>
1 29.4
2 190.
3 10.3
$`7`
# A tibble: 3 Γ 1
value
<dbl>
1 59.1
2 216.
3 8.94
$`8`
# A tibble: 3 Γ 1
value
<dbl>
1 60.0
2 172.
3 8.79
$`9`
# A tibble: 3 Γ 1
value
<dbl>
1 31.4
2 167.
3 10.2
## Now we can combine them with list_cbind()
:::map(s, function(x) {
purrr::as_tibble(colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE))
tibble%>% purrr::list_cbind() })
# A tibble: 3 Γ 5
`5`$value `6`$value `7`$value `8`$value `9`$value
<dbl> <dbl> <dbl> <dbl> <dbl>
1 23.6 29.4 59.1 60.0 31.4
2 181. 190. 216. 172. 167.
3 11.6 10.3 8.94 8.79 10.2
## And we can then add the actual variable it came from with mutate()
:::map(s, function(x) {
purrr::as_tibble(colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE))
tibble%>%
}) ::list_cbind() %>%
purrr::mutate(Variable = c("Ozone", "Solar.R", "Wind")) dplyr
# A tibble: 3 Γ 6
`5`$value `6`$value `7`$value `8`$value `9`$value Variable
<dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 23.6 29.4 59.1 60.0 31.4 Ozone
2 181. 190. 216. 172. 167. Solar.R
3 11.6 10.3 8.94 8.79 10.2 Wind
As we know, the above output is not tidy.
So below we repeat the steps using similar functions. First we use the superseded map_dfr()
function. At https://purrr.tidyverse.org/reference/map_dfr.html we can find more details about why this function was superseded and what that means.
## Sadly map_dfr() is now superseded (aka not recommended)
:::map_dfr(s, function(x) {
purrrcolMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE)
})
# A tibble: 5 Γ 3
Ozone Solar.R Wind
<dbl> <dbl> <dbl>
1 23.6 181. 11.6
2 29.4 190. 10.3
3 59.1 216. 8.94
4 60.0 172. 8.79
5 31.4 167. 10.2
## This is how we would have added back the Month variable
:::map_dfr(s, function(x) {
purrrcolMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE)
%>%
}) :::mutate(Month = as.integer(names(s))) dplyr
# A tibble: 5 Γ 4
Ozone Solar.R Wind Month
<dbl> <dbl> <dbl> <int>
1 23.6 181. 11.6 5
2 29.4 190. 10.3 6
3 59.1 216. 8.94 7
4 60.0 172. 8.79 8
5 31.4 167. 10.2 9
Instead, weβll use the t()
function for transposing a vector and purrr:list_rbind()
.
## Get data.frame / tibble outputs, but with each variable as a separate
## column. Here we used the t() or transpose() function.
:::map(s, function(x) {
purrr::as_tibble(t(colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE)))
tibble })
$`5`
# A tibble: 1 Γ 3
Ozone Solar.R Wind
<dbl> <dbl> <dbl>
1 23.6 181. 11.6
$`6`
# A tibble: 1 Γ 3
Ozone Solar.R Wind
<dbl> <dbl> <dbl>
1 29.4 190. 10.3
$`7`
# A tibble: 1 Γ 3
Ozone Solar.R Wind
<dbl> <dbl> <dbl>
1 59.1 216. 8.94
$`8`
# A tibble: 1 Γ 3
Ozone Solar.R Wind
<dbl> <dbl> <dbl>
1 60.0 172. 8.79
$`9`
# A tibble: 1 Γ 3
Ozone Solar.R Wind
<dbl> <dbl> <dbl>
1 31.4 167. 10.2
## Now we can row bind each of these data.frames / tibbles into a
## single one
:::map(s, function(x) {
purrr::as_tibble(t(colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE)))
tibble%>% purrr::list_rbind() })
# A tibble: 5 Γ 3
Ozone Solar.R Wind
<dbl> <dbl> <dbl>
1 23.6 181. 11.6
2 29.4 190. 10.3
3 59.1 216. 8.94
4 60.0 172. 8.79
5 31.4 167. 10.2
## Then with mutate, we can add the Month back
:::map(s, function(x) {
purrr::as_tibble(t(colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE)))
tibble%>%
}) ::list_rbind() %>%
purrr:::mutate(Month = as.integer(names(s))) dplyr
# A tibble: 5 Γ 4
Ozone Solar.R Wind Month
<dbl> <dbl> <dbl> <int>
1 23.6 181. 11.6 5
2 29.4 190. 10.3 6
3 59.1 216. 8.94 7
4 60.0 172. 8.79 8
5 31.4 167. 10.2 9
For the above task though, we might prefer to use dplyr
functions.
## group_by() is in a way splitting our input data.frame / tibble by
## our variable of interest. Then summarize() helps us specify how we
## want to use that data, before it's all put back together into a
## tidy tibble.
%>%
airquality ::group_by(Month) %>%
dplyr::summarize(
dplyrOzone = mean(Ozone, na.rm = TRUE),
Solar.R = mean(Solar.R, na.rm = TRUE),
Wind = mean(Wind, na.rm = TRUE)
)
# A tibble: 5 Γ 4
Month Ozone Solar.R Wind
<int> <dbl> <dbl> <dbl>
1 5 23.6 181. 11.6
2 6 29.4 190. 10.3
3 7 59.1 216. 8.94
4 8 60.0 172. 8.79
5 9 31.4 167. 10.2
tapply
tapply()
is used to apply a function over subsets of a vector. It can be thought of as a combination of split()
and sapply()
for vectors only. Iβve been told that the βtβ in tapply()
refers to βtableβ, but that is unconfirmed.
str(tapply)
function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)
The arguments to tapply()
are as follows:
X
is a vectorINDEX
is a factor or a list of factors (or else they are coerced to factors)FUN
is a function to be applied- β¦ contains other arguments to be passed
FUN
simplify
, should we simplify the result?
Given a vector of numbers, one simple operation is to take group means.
## Simulate some data
<- c(rnorm(10), runif(10), rnorm(10, 1))
x ## Define some groups with a factor variable
<- gl(3, 10)
f f
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Levels: 1 2 3
tapply(x, f, mean)
1 2 3
-0.2196580 0.5022941 0.7477475
We can also apply functions that return more than a single value. In this case, tapply()
will not simplify the result and will return a list. Hereβs an example of finding the range()
(min and max) of each sub-group.
tapply(x, f, range)
$`1`
[1] -1.483876 0.519538
$`2`
[1] 0.1970253 0.8083186
$`3`
[1] -0.5991801 2.3153189
With purrr
, we donβt have a tapply()
direct equivalent but we can still get similar results thanks to the split()
function.
split(x, f) %>% purrr::map_dbl(mean)
1 2 3
-0.2196580 0.5022941 0.7477475
split(x, f) %>% purrr::map(range)
$`1`
[1] -1.483876 0.519538
$`2`
[1] 0.1970253 0.8083186
$`3`
[1] -0.5991801 2.3153189
apply()
The apply()
function is used to a evaluate a function (often an anonymous one) over the margins of an array. It is most often used to apply a function to the rows or columns of a matrix (which is just a 2-dimensional array). However, it can be used with general arrays, for example, to take the average of an array of matrices. Using apply()
is not really faster than writing a loop, but it works in one line and is highly compact.
str(apply)
function (X, MARGIN, FUN, ..., simplify = TRUE)
The arguments to apply()
are
X
is an arrayMARGIN
is an integer vector indicating which margins should be βretainedβ.FUN
is a function to be applied...
is for other arguments to be passed toFUN
Here I create a 20 by 10 matrix of Normal random numbers. I then compute the mean of each column.
<- matrix(rnorm(200), 20, 10)
x head(x)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.1478102 0.78194041 1.39597861 0.3374186 -0.3772066 -0.1551193
[2,] 1.1928082 -0.55047148 -0.44067310 0.1790926 0.7971628 1.2972090
[3,] -0.3666672 -0.21764989 0.03522523 0.2310200 -0.3687196 0.9050708
[4,] 0.8768993 -0.38795209 0.17159879 1.0665414 -2.0481121 -0.6035171
[5,] -1.1538589 0.05690375 0.73959609 0.1201938 -0.6753521 -0.1134883
[6,] 0.1468126 1.89473187 0.46806980 -0.9837391 -1.3275220 -0.5301819
[,7] [,8] [,9] [,10]
[1,] -1.07378726 0.33293402 -0.9171032 1.0278666
[2,] 2.24127950 0.58601091 1.4886753 -0.4868490
[3,] -0.69385309 0.55338482 -0.6974887 1.4257935
[4,] -1.04739364 -0.42479752 -0.3919240 -2.2119075
[5,] 0.20018890 0.04881347 0.3149072 -0.9598799
[6,] -0.01155237 1.28176329 -0.1662271 1.1416646
apply(x, 2, mean) ## Take the mean of each column
[1] 0.21491707 0.15897344 0.10584976 -0.10490596 -0.61422150 -0.11846196
[7] -0.26240120 0.28376379 -0.24176678 0.05051771
I can also compute the sum of each row.
apply(x, 1, sum) ## Take the mean of each row
[1] 1.5007321 6.3042448 0.8061160 -5.0005644 -1.4219760 1.9138196
[7] -3.2819941 -1.0722210 0.7149896 1.4356676 -4.7687756 1.6602942
[13] 1.4402500 -3.6041757 0.8561434 -2.3253953 -2.1542951 -1.5485611
[19] -2.3146864 0.3056750
In both calls to apply()
, the return value was a vector of numbers.
Youβve probably noticed that the second argument is either a 1 or a 2, depending on whether we want row statistics or column statistics. What exactly is the second argument to apply()
?
The MARGIN
argument essentially indicates to apply()
which dimension of the array you want to preserve or retain.
So when taking the mean of each column, I specify
apply(x, 2, mean)
because I want to collapse the first dimension (the rows) by taking the mean and I want to preserve the number of columns. Similarly, when I want the row sums, I run
apply(x, 1, sum)
because I want to collapse the columns (the second dimension) and preserve the number of rows (the first dimension).
With purrr
this is a bit more complicated as it has been generalized. Matrices are arrays with 2 dimensions (margins), but they can have more. Here we use the array_branch()
function to change x
into a list, then use regular map_*()
functions to compute what we want.
array_branch(x, 2) %>% map_dbl(mean)
[1] 0.21491707 0.15897344 0.10584976 -0.10490596 -0.61422150 -0.11846196
[7] -0.26240120 0.28376379 -0.24176678 0.05051771
array_branch(x, 1) %>% map_dbl(sum)
[1] 1.5007321 6.3042448 0.8061160 -5.0005644 -1.4219760 1.9138196
[7] -3.2819941 -1.0722210 0.7149896 1.4356676 -4.7687756 1.6602942
[13] 1.4402500 -3.6041757 0.8561434 -2.3253953 -2.1542951 -1.5485611
[19] -2.3146864 0.3056750
Col/Row Sums and Means
For the special case of column/row sums and column/row means of matrices, we have some useful shortcuts.
rowSums
=apply(x, 1, sum)
rowMeans
=apply(x, 1, mean)
colSums
=apply(x, 2, sum)
colMeans
=apply(x, 2, mean)
The shortcut functions are heavily optimized and hence are much faster, but you probably wonβt notice unless youβre using a large matrix.
Another nice aspect of these functions is that they are a bit more descriptive. Itβs arguably more clear to write colMeans(x)
in your code than apply(x, 2, mean)
.
Other Ways to Apply
You can do more than take sums and means with the apply()
function.
For example, you can compute quantiles of the rows of a matrix using the quantile()
function.
<- matrix(rnorm(200), 20, 10)
x head(x)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.40643783 0.535177795 -0.7178755 -0.8254484 -0.09009748 0.27766089
[2,] 1.67337736 1.101405423 0.3616568 2.2453738 0.34282464 -0.36570372
[3,] 1.30327515 -0.950264855 0.1474001 -0.3453589 -1.18043640 -0.56348912
[4,] -0.01095906 -0.519561336 0.9359043 -2.9031949 0.85979431 -0.53688648
[5,] 2.14990998 0.007775716 -2.0175120 0.4995306 0.30601808 -0.09258318
[6,] 0.32169081 -0.631104172 1.1239667 0.5120566 0.20922259 -0.12703800
[,7] [,8] [,9] [,10]
[1,] 0.08226073 0.4265114 -0.5363039 -0.3771463
[2,] -1.04995918 -0.9506480 2.4528800 1.0970088
[3,] 0.30150996 0.8995539 -0.2999293 -1.5529530
[4,] 0.06780358 -1.7505990 -1.0639637 0.6571155
[5,] -0.26781160 -0.5434376 0.5430396 -0.1144159
[6,] -0.46197713 0.3387829 0.1298647 -0.5817599
## Get row quantiles
apply(x, 1, quantile, probs = c(0.25, 0.75))
[,1] [,2] [,3] [,4] [,5] [,6]
25% -0.4965145 -0.1885716 -0.8535709 -0.9321944 -0.2294627 -0.3782423
75% 0.3742436 1.5303844 0.2629825 0.5097875 0.4511525 0.3345099
[,7] [,8] [,9] [,10] [,11] [,12]
25% -0.1430691 -1.12572468 -0.4371619 -0.95760593 -0.3397996 -1.0746719
75% 0.4073344 -0.05669309 1.3434728 -0.02388901 0.5526121 0.5678845
[,13] [,14] [,15] [,16] [,17] [,18]
25% -0.3667007 -0.1613415 -0.2481536 -0.93456633 -1.2079463 -0.3128054
75% 0.1239238 0.6210515 0.6465062 0.03738274 -0.4282615 0.4599200
[,19] [,20]
25% -0.8690319 -0.5120423
75% 0.2777756 0.1436458
Notice that I had to pass the probs = c(0.25, 0.75)
argument to quantile()
via the ...
argument to apply()
.
array_branch(x, 1) %>%
map(quantile, probs = c(0.25, 0.75)) %>%
map(~ as.data.frame(t(.x))) %>%
list_rbind()
25% 75%
1 -0.4965145 0.37424360
2 -0.1885716 1.53038438
3 -0.8535709 0.26298250
4 -0.9321944 0.50978749
5 -0.2294627 0.45115247
6 -0.3782423 0.33450989
7 -0.1430691 0.40733437
8 -1.1257247 -0.05669309
9 -0.4371619 1.34347278
10 -0.9576059 -0.02388901
11 -0.3397996 0.55261206
12 -1.0746719 0.56788448
13 -0.3667007 0.12392384
14 -0.1613415 0.62105147
15 -0.2481536 0.64650623
16 -0.9345663 0.03738274
17 -1.2079463 -0.42826149
18 -0.3128054 0.45992000
19 -0.8690319 0.27777558
20 -0.5120423 0.14364577
Vectorizing a Function
Letβs talk about how we can βvectorizeβ a function.
What this means is that we can write function that typically only takes single arguments and create a new function that can take vector arguments.
This is often needed when you want to plot functions.
Hereβs an example of a function that computes the sum of squares given some data, a mean parameter and a standard deviation. The formula is \(\sum_{i=1}^n(x_i-\mu)^2/\sigma^2\).
<- function(mu, sigma, x) {
sumsq sum(((x - mu) / sigma)^2)
}
This function takes a mean mu
, a standard deviation sigma
, and some data in a vector x
.
In many statistical applications, we want to minimize the sum of squares to find the optimal mu
and sigma
. Before we do that, we may want to evaluate or plot the function for many different values of mu
or sigma
.
<- rnorm(100) ## Generate some data
x sumsq(mu = 1, sigma = 1, x) ## This works (returns one value)
[1] 193.3047
However, passing a vector of mu
s or sigma
s wonβt work with this function because itβs not vectorized.
sumsq(1:10, 1:10, x) ## This is not what we want
[1] 124.5871
Thereβs even a function in R called Vectorize()
that automatically can create a vectorized version of your function.
So we could create a vsumsq()
function that is fully vectorized as follows.
<- Vectorize(sumsq, c("mu", "sigma"))
vsumsq vsumsq(1:10, 1:10, x)
[1] 193.30472 120.56844 107.91586 103.76324 101.96723 101.05972 100.55345
[8] 100.25137 100.06243 99.94026
## The details are a bit complicated though
## as we can see below
vsumsq
function (mu, sigma, x)
{
args <- lapply(as.list(match.call())[-1L], eval, parent.frame())
names <- if (is.null(names(args)))
character(length(args))
else names(args)
dovec <- names %in% vectorize.args
do.call("mapply", c(FUN = FUN, args[dovec], MoreArgs = list(args[!dovec]),
SIMPLIFY = SIMPLIFY, USE.NAMES = USE.NAMES))
}
<environment: 0x11c9347b8>
Pretty cool, right?
Parallelize your functions
We donβt have time to dive into the details, but with furrr
we can parallelize purrr
functions thanks to the future
package. More details at https://furrr.futureverse.org/.
Similarly, with BiocParallel::bplapply()
we can parallelize lapply()
commands. Details at https://bioconductor.org/packages/release/bioc/vignettes/BiocParallel/inst/doc/Introduction_To_BiocParallel.html and more generally at https://bioconductor.org/packages/BiocParallel/.
Summary
The loop functions in R are very powerful because they allow you to conduct a series of operations on data using a compact form
The operation of a loop function involves iterating over an R object (e.g. a list or vector or matrix), applying a function to each element of the object, and the collating the results and returning the collated results.
Loop functions make heavy use of anonymous functions, which exist for the life of the loop function but are not stored anywhere
The
split()
function can be used to divide an R object in to subsets determined by another variable which can subsequently be looped over using loop functions.With
purrr
you have very tight control on the expected output, unlike with base R.
Post-lecture materials
Final Questions
Here are some post-lecture questions to help you think about the material discussed.
- Write a function
compute_s_n()
that for any givenn
computes the sum
\[ S_n = 1^2 + 2^2 + 3^2 + \ldots + n^2 \]
Report the value of the sum when \(n\) = 10.
Define an empty numerical vector
s_n
of size 25 usings_n <- vector("numeric", 25)
and store in the results of \(S_1, S_2, \ldots, S_n\) using a for-loop.Repeat Q3, but this time use
sapply() or functions from
purrr.
Plot
s_n
versusn
. Use points defined by \(n= 1, \ldots, 25\)
Additional Resources
R session information
options(width = 120)
::session_info() sessioninfo
β Session info βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
setting value
version R version 4.3.1 (2023-06-16)
os macOS Ventura 13.6
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2023-09-28
pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)
β Packages βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
package * version date (UTC) lib source
brio 1.1.3 2021-11-30 [1] CRAN (R 4.3.0)
cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)
colorout 1.3-0 2023-09-28 [1] Github (jalvesaq/colorout@8384882)
desc 1.4.2 2022-09-08 [1] CRAN (R 4.3.0)
digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)
dplyr 1.1.3 2023-09-03 [1] CRAN (R 4.3.0)
evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)
fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)
fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)
htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)
htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)
jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)
knitr 1.44 2023-09-11 [1] CRAN (R 4.3.0)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
pkgload 1.3.2.1 2023-07-08 [1] CRAN (R 4.3.0)
purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)
rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)
rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.3.0)
rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
sum776 0.99.0 2023-09-25 [1] Github (lcolladotor/sum776@b9cb6e3)
testthat * 3.1.10 2023-07-06 [1] CRAN (R 4.3.0)
tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)
vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)
waldo 0.5.1 2023-05-08 [1] CRAN (R 4.3.0)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)
xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)
yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)
[1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ