This Tutorial explains the concept behind, and the implementation of
Monte Carlo Simulations (MCS) in R. To implement MCS, we will make use
of one of the following two tools: for
-loops or the
map
-functions from the package purrr. Therefore, the tutorial
also contains a general methodological section on these two approaches
that you might read independently of the particular applications to MCS
below. If you want to read a bit more about the topic of the tutorial,
you may have a look at chapter 21 of R for Data Science.
This tutorial assumes you are using a typical directory structure as described in the respective tutorial. During this tutorial we will use the following packages:
library(here)
library(tidyr)
library(data.table)
library(dplyr)
library(ggplot2)
library(scales)
library(purrr)
library(icaeDesign)
Whenever you are doing a certain task more than twice you should immediately become suspicious and ask yourself whether you can automate the task. We already discussed the reasons behind this rationale and introduced one important way to achieve automation: the use of functions.
Here we talk about a general approach that complements the use of functions when it comes to automation: the idea of iteration.
There are two alternative ways of how to implement iterations in R:
using for
-loops, or a map
-function from the
package purrr. The latter
option is clearer and more concise, the former is more flexible and
intuitive. So, its generally good to know about both ways and then
choose the approach that you find more appealing for the case at
hand.
As you will see below, the two iteration concepts and functions and complements, not substitutes: often, especially in the context of MCS, you will write functions that you then combine with the iteration tools described below.
The idea of a for-loop is to ‘loop over’ an object, an do something for (or on) every element in this object.
In general, every for-loop (should) consist of three parts:
Below we is an example loop that loops through a list with three vectors as elements elements, and computes the mean for each element of the list. Before we inspect the loop as such, this is the list we are looping over:
<- list(
base_list "element_1" = c(1, 4, 5, 6),
"element_2" = c(9, 2, 2, 8),
"element_3" = c(4, 3, 0, 7)
)
And this is how the loop looks like:
<- rep(NA, length(base_list)) # The output container
result for (i in seq_along(base_list)) { # The looping sequence
<- mean(base_list[[i]]) # The action body
result[[i]] }
The first part is actually located outside the for
loop in a strict sense. But for the sake of efficiency, it is very
important that you prepare a vector into which you can save the results
of your loop. The vector should be of the same length as your loop will
have iterations, and it is a good idea to use a vector full of
NA
s that are then replaced during the actual loop. Thus,
the length of the output contains should equal the length of the object
over which you are looping.
The second part is the looping sequence, and it
determines the number of times your loop will iterate the task that is
specified in the action body below. In the case above, we want to
compute the mean for each element of base_list
. In other
words: we are looping over each element of base_list
and
the loop will comprise of three iterations.
The looping sequence always starts with the keyword for
and an opening bracket. Then you specify a keyword that can be an
(almost) arbitrary word. But usually, one uses a single letter, most
commonly the letter ‘i’. This letter will take a different value during
each iteration of the loop. The value it takes is specified by the
vector that comes after the next keyword in
: in the case
above, this sequence is created by
seq_along(base_list)
:
seq_along(base_list)
## [1] 1 2 3
As you can see it is an index vector, i.e. a vector that starts with
1, only contains subsequent integers, and has so many elements as the
object that has been passed as an argument to
seq_along()
.
In the case above, this means that in the first iteration of the
loop, i
takes the value 1
. In the second
iteration it takes the value 2
. And in the third and last
iteration it takes the value 3
. You could make this visible
by writing a loop like this:
for (i in seq_along(base_list)){
print(i)
}
## [1] 1
## [1] 2
## [1] 3
In principle, the keyword of your loop can take arbitrary values. For
instance you could also loop over the names of
base_list
:
for (i in names(base_list)){
print(i)
}
## [1] "element_1"
## [1] "element_2"
## [1] "element_3"
But this is usually not a very good idea. It is usually best to loop over an index vector of the main object of interest.
The third part of the loop is the action body. This is where the interesting code resides in, and where you determine what should be done during each iteration of the loop. Here, the code usually produces one result per iteration that should be allocated to the output contained created in the first step.
In the present case, we wanted to compute the mean for each element
of the list base_list
. Thus, our action body only consists
of a call of the function mean
, and then saves the result
in the initial output container. If we then inspect the output
container, we see that all the NA
s were replaced by the
means of the respective elements of base_list
:
<- rep(NA, length(base_list)) # The output container
result print(result) # Only contains NAs
## [1] NA NA NA
for (i in seq_along(base_list)) { # The looping sequence
<- mean(base_list[[i]]) # The action body
result[[i]]
}print(result) # NAs were replaced by the means
## [1] 4.00 5.25 3.50
While the loop above was very simple, the basic concept is extremely powerful: the action body of a loop can be of arbitrary complexity, making for loops a very powerful programming tool. But caution: very complex computations take time, and iterating them many times using a for loop can take a lot of time.
Here are some general tips for developing for-loops:
print()
statements into the action body; this
is especially useful for your looping keyword i
since this
makes explicit what you are actually looping overseq_along()
or
seq_len()
.Before proceeding with the tutorial, it might be a good idea to practice. Here are some suggestions for exercises (you find possible solutions here):
c(1,2,3,4,5)
and computes the square root for each
element.0
and standard deviation 1
. You can
get a single draw by calling rnorm(1)
:rnorm(1)
## [1] -0.6990273
<- tibble::tibble(
ex_tib "a" = rnorm(5, mean = 5, sd = 2),
"b" = rpois(5, lambda = 3),
"c" = rcauchy(5, location = 3, scale = 2)
) ex_tib
## # A tibble: 5 × 3
## a b c
## <dbl> <int> <dbl>
## 1 5.40 5 6.93
## 2 7.56 3 -7.81
## 3 6.91 2 -2.24
## 4 5.38 1 4.69
## 5 2.60 3 0.781
The package purrr
provides a number of functions that
are designed to facilitate the development of iteration tasks. To this
end, they allow you to, first, think about how to accomplish what you
want to do for a single element of the object of interest and then,
second, simply pass your solution to a function from the
map
-family and let it take care of the iteration.
The map
-family is made up of five functions, which only
differ by the type of output they produce: map()
, which
returns output in the form of a list, as well as map_lgl()
,
map_int()
, map_dbl()
, and
map_chr()
, which return output in the form of an atomic
vector of type logical
, integer
,
double
, and character
, respectively.
All five functions, however, work very similarly and take the same
arguments, which means: once you understood one of them you understood
all of them. All of them take two mandatory arguments: .x
and .f
. The first argument, .x
, specifies a
vector, and the second, .f
, a function that gets applied to
each element of the vector.
Lets consider one of the examples above: take our list
base_list
and compute the mean for each of its elements.
Since base_list
is the object over which we want to iterate
it is passed to the function via the argument .x
, and since
we want to compute the mean, we pass the function mean()
via the argument .f
. In the following case we use
purrr::map()
, so the result will be a
list
:
::map(.x = base_list, .f = mean) purrr
## $element_1
## [1] 4
##
## $element_2
## [1] 5.25
##
## $element_3
## [1] 3.5
If we would like the result to be, for instance, of type
double
we simply use purrr::map_dbl()
instead:
::map_dbl(.x = base_list, .f = mean) purrr
## element_1 element_2 element_3
## 4.00 5.25 3.50
Sometimes you want to pass a function with optional arguments to
.f
. The following application, for instance, does not
compute the mean if there are NA
s in the vectors:
<- list(c(1, 4, NA, 2), c(NA, NA, 3, 0), c(9, 8, 3, 4))
input_vec ::map_dbl(.x = input_vec, .f = mean) purrr
## [1] NA NA 6
To avoid this we need to set the optional argument na.rm
of mean()
to TRUE
:
::map_dbl(.x = input_vec, .f = ~mean(.x, na.rm=TRUE)) purrr
## [1] 2.333333 1.500000 6.000000
As you can see we added a ~
in front of the function
name and then referred to the first argument of the function and over
which we want to iterate as .x
. Then we can add additional
arguments as desired.
Its now a good idea to practice the use of the map-functions. Note that the exercises are very similar to the ones from the section on for-loops above. This is because for-loops and the map-functions are geared to do the same things just in different ways (again, the solutions can be found here).
Compute the square root for each element of the vector
c(1,2,3,4,5)
by using a map-function from
purrr
. In the first case, your solution should take the
form of a list, in the second case an atomic vector of type
character
.
Draw 10 numbers from a normal distribution with mean
2
and standard deviation 4
. Again, use a
map-function from
purrr´ and return the result one time as an atomic vector of type
doubleand one time of type
character`.
Computes the median of the elements in each column of the tibble
ex_tib
as defined above. The result should come in form of
a list
.
The basic idea of a MCS is to repeat a particular random process many times. It does not come to a surprise that this is not difficult to do with the tools we just acquired.
Conducting a MCS always involves the following steps:
Technically, there are two alternative ways of how to implement a MCS
in R: using for
-loops, or a map
-function from
the package purrr, as
described above. The latter option is clearer and more concise, the
former is more flexible and intuitive. Below we use both alternatives,
so its your choice of how you are conducting your own experiments.
We start with a simple example that does not relate to the topic of sampling. Then we consider a more complex example and replicate the example from the lecture slides where we drew a sample from the ball pit.
The first example we consider is a MCS of an unfair coin. Assume we want to throw the coin many times to figure our the probability for the coin to show ‘head’.
In the first step we need to define the random process. In the present case this is the process of throwing a coin that shows ‘head’ with a certain probability.1 To this end, we define a function that does exactly this:
#' Throwing a coin
<- function(prob_head){
throw_coin sample(c("Head", "Tail"), size = 1, prob = c(prob_head, 1-prob_head))
}
In the second case, we repeat the random process. Lets do this 1000 times, assuming that the probability for our unfair coin to show ‘head’ is set to 80%. Using a for loop this would look like this:
<- 1000
iterations <- 0.8
probability_head <- rep(NA, iterations)
output_container for (i in seq_len(iterations)){
<- throw_coin(prob_head = probability_head)
output_container[i] }
If we used a map-function we would write the following:
<- map_chr(.x = 1:iterations, ~throw_coin(prob_head = 0.8)) map_output
Both approaches accomplish the same thing. We can now
summarize or visualize the results. We
could do this by using quantitative indicators, but the most common and
usually most illustrative way is to use a histogram. We can do this by
using ggplot2::ggplot()
, we only need to transform our
output container into a tibble
:
<- tibble::tibble(map_output)
output_tib
ggplot(data = output_tib, aes(x=map_output)) +
scale_y_continuous(expand = expansion(add = c(0, 40))) +
labs(y = "Times observed") +
geom_histogram(stat = "count") +
theme_bw() + theme(axis.title.x = element_blank())
In this case we want to simulate the act of drawing a sample from an artificial population of a ball pid.
This time, the first step involves the notion of a population since it represents the drawing of sample from this population. Thus, in a very first step we need to create such a population. In our case, this is a ball pid with 5000 balls, 65 per cent of which are white and 35 per cent of which are grey:
<- 5000
ball_pid_size <- 0.65
ball_pid_share_white <- as.integer(ball_pid_share_white*ball_pid_size)
white_balls <- ball_pid_size - white_balls
grey_balls <- c(rep("white", white_balls), rep("grey", grey_balls))
ball_pid_colors
<- tibble::tibble(
ball_pid id = seq(1, ball_pid_size),
color = sample(ball_pid_colors)
)head(ball_pid, 4)
## # A tibble: 4 × 2
## id color
## <int> <chr>
## 1 1 white
## 2 2 white
## 3 3 grey
## 4 4 white
Now we can write a function that formalizes the act of drawing a sample from the ball pid and computing the share of white balls:
#' Function that takes a sample of balls and computes share of white balls
<- function(bowl, sample_size, replace_balls=FALSE){
sample_balls <- sample(x = bowl, size = sample_size, replace = replace_balls)
sample_obtained sum(sample_obtained == "white") / sample_size
}
We now conduct the MCS by iterating the process of drawing such a sample 1000 times. Suppose we want to use our MCS to study the effect of taking samples of different sizes, say 20, 50, or 100. Using a for-loop this could look like this:
<- 1000
n_samples <- rep(NA, n_samples)
results_n20 <- rep(NA, n_samples)
results_n50 <- rep(NA, n_samples)
results_n100
for (i in seq_len(n_samples)){
<- sample_balls(bowl = ball_pid$color, sample_size = 20)
results_n20[i] <- sample_balls(bowl = ball_pid$color, sample_size = 50)
results_n50[i] <- sample_balls(bowl = ball_pid$color, sample_size = 100)
results_n100[i]
}
<- tibble(
result_table sample_size20 = results_n20,
sample_size50 = results_n50,
sample_size100 = results_n100
)
A variant with a map-function would look like this:
<- purrr::map_dbl(
results_n20_map .x = 1:n_samples,
.f = ~sample_balls(bowl = ball_pid$color, sample_size = 20))
<- purrr::map_dbl(
results_n50_map .x = 1:n_samples,
.f = ~sample_balls(bowl = ball_pid$color, sample_size = 50))
<- purrr::map_dbl(
results_n100_map .x = 1:n_samples,
.f = ~sample_balls(bowl = ball_pid$color, sample_size = 100))
Should we create a new population for during each iteration? It depends on the question asked. If you are interested in one particular population, you should keep this population fixed over all iteration. However, if you are interested in more general results for a population that itself involves random noise, it might be appropriate the re-generate the population for each iteration of your simulation. In the cases considered here, however, we can treat the population as fixed.
We can now visualize our results using a histogram:
<- result_table %>%
hist_visualization pivot_longer(cols = everything(),
names_to = "Sample size",
values_to = "Share of white balls") %>%
mutate(
`Sample size` = as.integer(
::str_remove_all(`Sample size`, "[^0-9.-]"))
stringr%>%
) ggplot(data = ., aes(x=`Share of white balls`)) +
geom_histogram(binwidth = 0.02, fill="#00395B") +
scale_y_continuous(expand = expansion(add = c(0, 1))) +
scale_x_continuous(labels = percent_format()) +
labs(
x = "Share of white balls",
y = "Number of samples",
title = "True share: 65%") +
geom_vline(xintercept = 0.65) +
facet_wrap(~`Sample size`, scales = "fixed") +
theme_icae() +
theme(
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank())
hist_visualization
Of course, we could also illustrate the result quantitatively:
%>%
result_table pivot_longer(cols = everything(),
names_to = "Sample size",
values_to = "Share of white balls") %>%
mutate(
`Sample size` = as.integer(
::str_remove_all(`Sample size`, "[^0-9.-]"))
stringr%>%
) group_by(`Sample size`) %>%
summarise(
`Mean share` = mean(`Share of white balls`),
`Standard deviation` = sd(`Share of white balls`)
)
## # A tibble: 3 × 3
## `Sample size` `Mean share` `Standard deviation`
## <int> <dbl> <dbl>
## 1 20 0.654 0.107
## 2 50 0.653 0.0695
## 3 100 0.650 0.0464
We end this tutorial by providing an overview over a number of functions that are usually extremely useful in the context of MCS, or analyzing random processes more generally:
We already encountered the function sample()
, which we
can use to draw random samples: as a first argument x
it
accepts a vector, from which you can then draw random samples. The
second argument, size
, controls the size of the sample. You
can either draw random samples with or without replacement by specifying
the optional argument replace
, and you can set
probabilities to draw a particular element of x
by giving a
vector of probabilities via prob
.
When you want to simulate random draws from a particular
distribution, R
comes with a very handy syntax: different
probability distributions all have their own abbrevation and depending
what you want to do with this distribution you just put a letter in from
of this abbrevation. If, for instance, you want to draw a random element
from such a distribution you just add the letter r
.
For instance, if you want to simulate a draw from a normal
distribution you add r
to the abbrevation of the normal
distribution, which is norm
:
rnorm(n = 5)
## [1] 0.43507858 0.93783087 1.51554430 0.01191879 0.90801087
If you want to draw from an exponential distribution you add
r
to the abbrevation for the exponential distribution,
which is exp
:
rexp(n = 4)
## [1] 0.08803341 1.68605246 1.19286222 0.64202638
In fact, here is an overview over the various prefixes you can use:
p
for “probability”, the cumulative distribution
function (CDF)q
for “quantile”, the inverse CDFd
for “density”, the density function (PDF)To get the CDF for a particular quantile of the normal distribution you can do:
pnorm(q = 0.25)
## [1] 0.5987063
Here you find an overview over the abbrevations for some of the most common distribution, but its easy to find additional information in the internet as well:
Distribution | Kind | Abbrevation | Parameters |
---|---|---|---|
Binomial | Discrete | binom |
size , prob |
Poisson | Discrete | pois |
lambda |
Uniform | Continuous | punif |
min , max |
Normal | Continuous | norm |
mean , sd |
Exponential | Continuous | exp |
rate |
Absolutely essential is the function set.seed()
. It
allows us set a particular value for the random number generator of
R
. You can read more about how R
actually
creates random numbers using a pseudo random number generator called the
Mersenne
Twister. But the general take away is the following: random numbers
in R are only ‘pseudo’ in the sense that whenever the Mersenne Twister
starts from the same location it produces the same random numbers. And
set.seed()
lets you pick this starting point. This is
important because it makes MCS reproducible.
To illustrate this, consider the function rnorm()
, which
creates draws from a normal distribution. If we just call the function
the result is always different due to the Mersenne Twister
algorithm:
rnorm(5)
## [1] 0.40586170 0.69036963 -0.19302003 1.29006415 -0.01150031
rnorm(5)
## [1] 1.12742057 0.48053323 0.44261527 1.02904187 0.05250098
But if we set the seed for the Mersenne Twister explicitly, the outcome becomes reproducible:
set.seed(123)
rnorm(5)
## [1] -0.56047565 -0.23017749 1.55870831 0.07050839 0.12928774
set.seed(123)
rnorm(5)
## [1] -0.56047565 -0.23017749 1.55870831 0.07050839 0.12928774
This is extremely useful if we want to keep our results reproducible.
In reality such an experiment only makes sense if we do not now this probability, but here we are dealing with this situation only for the purpose of illustration.↩︎