- Introduction
- 1. Input data
- 2. First approach: generate virtual species distributions by defining response functions
- 3. Second approach: generate virtual species with a Principal Components Analysis
- 4. Conversion of environmental suitability to presence-absence
- 5. Generating random virtual species
- 6. Exploring and using the outputs of virtualspecies
- 7. Sampling occurrence points
- 8. Introduction a distribution bias
- 9. Using virtual species with the SDM package biomod (not yet written)
- 10. Creating dispersal limitations with the MIGCLIM R package (not yet written)

*Many thanks to Céline Bellard for commenting and correcting this tutorial*

This complete tutorial introduces all the possibilities of the `virtualspecies`

package. It was written with the objective to be helpful for both beginners and experienced R users. You can read this tutorial in full or jump to the particular section you are looking for. In each section, you will find simple examples introducing each function, followed by detailed examples for almost every possible parametrisation in `virtualspecies`

.

After a small introduction on the spatial data used as input for the `virtualspecies`

package (section 1.), you will be introduced to the basics of generating virtual species distributions: the two possible approaches to create species-environment relationships (sections 2. and 3.), followed by the conversion of environmental suitability to presence-absence (section 4.). After that, you will see how to generate random virtual species (section 5.), and most importantly, how to explore, extract, and use the outputs of `virtualspecies`

(section 6.). Then, you will see how to sample occurrence points (section 7.). If you are interested, you may also learn about the introduction of distribution limitations in section 8.

In the following sections, you will see how to combine `virtualspecies`

with external packages : the species distribution models package biomod (section 9.), and the dispersal limitation package MIGCLIM (section 10. - not yet written).

I will make extensive use of climate data as an example here, but remember that you can use other types of data, as long as it is in the format described in section 1.

`virtualspecies`

uses spatialised environmental data as an input. These data must be gridded spatial data in the `RasterStack`

format of the package `raster`

.

The input data for virtual species is gridded spatial data (*i.e.*, raster data).

To summarise, a raster is a map regularly divided into small units, called pixels. Each pixel has its own value. These values can be, for example, temperature values, for a raster of temperature. Raster data can be imported into R using the package `raster`

. The command `stack()`

will allow you to import your own rasters (stored on your hard drive) into an object of class `RasterStack`

. Below you will find some concrete examples.

As stated above, the required format is a `RasterStack`

containing all the environmental variables with which you want to generate a virtual species distribution. Note that I may use interchangeably the variables and layers in this tutorial, because they roughly refer to the same aspect: a layer is the spatial representation of an environmental variable.

The most important part here is that every layer of your `RasterStack`

must be correctly named, because these names will be used when generating virtual species.

Hence, I strongly advise using explicit names for your layers. You can use “variable1”, “variable2”, etc. or “layer1”, “layer2”, etc. but names like “bio1”, “bio2”, “bio3”, (bioclimatic variable names) or “temp1”, “temp2”, etc. will reduce confusion.

You can access the names of the layers of your stack with `names(my.stack)`

, and modify them with `names(my.stack) <- c("name1", "name2", etc.)`

.

WorldClim (www.worldclim.org) freely provides gridded climate data (temperature and precipitations) for the entire World. These data can be downloaded to your hard drive from the website above, and then imported into R using the stack command: `stack("my.path/MyWorldClimStack.bil")`

.

Otherwise, a much simpler solution is to directly download them into R using the function `getData()`

(requires an internet connection).

You have to provide the type of environmental variables you are looking for:

- tmean = average monthly mean temperature (°C * 10)
- tmin = average monthly minimum temperature (°C * 10)
- tmax = average monthly maximum temperature (°C * 10)
- prec = average monthly precipitation (mm)
- bio = bioclimatic variables derived from the tmean, tmin, tmax and prec
- alt = altitude (elevation above sea level) (m) (from SRTM)

And the resolution: 0.5, 2.5, 5, or 10 minutes of a degree. Note that at fine resolutions the files downloaded are very heavy and may take a long time; at the finest resolution (0.5) you cannot download the global file, and you have to provide a longitude and latitude to obtain a tile of the world (see `?getData`

for help).

Here is the example in practice:

```
library(raster, quietly = T)
worldclim <- getData("worldclim", var = "bio", res = 10)
worldclim
```

```
## class : RasterStack
## dimensions : 900, 2160, 1944000, 19 (nrow, ncol, ncell, nlayers)
## resolution : 0.1666667, 0.1666667 (x, y)
## extent : -180, 180, -60, 90 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84
## names : bio1, bio2, bio3, bio4, bio5, bio6, bio7, bio8, bio9, bio10, bio11, bio12, bio13, bio14, bio15, ...
```

You can see that the object is of class `RasterStack`

, ready to use for `virtualspecies`

. The names of the layers are also convenient.

We can easily access a subset of the layers with `[[]]`

:

```
# The first four layers
worldclim[[1:4]]
```

```
## class : RasterStack
## dimensions : 900, 2160, 1944000, 4 (nrow, ncol, ncell, nlayers)
## resolution : 0.1666667, 0.1666667 (x, y)
## extent : -180, 180, -60, 90 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84
## names : bio1, bio2, bio3, bio4
```

```
# Layers bio1 and bio12 (annual mean temperature and annual precipitation)
worldclim[[c("bio1", "bio12")]]
```

```
## class : RasterStack
## dimensions : 900, 2160, 1944000, 2 (nrow, ncol, ncell, nlayers)
## resolution : 0.1666667, 0.1666667 (x, y)
## extent : -180, 180, -60, 90 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84
## names : bio1, bio12
```

And we can plot our variables:

```
# Plotting bio1 and bio12
par(oma = c(0.1, 0.1, 0.1, 2.1))
plot(worldclim[[c("bio1", "bio12")]])
```

The first approach to generate a virtual species consists in defining its response functions to each of the environmental variables contained in our `RasterStack`

. These responses are then combined to calculcate the environmental suitability of the virtual species.

The function providing this approach is `generateSpFromFun`

.

Before we start using the package, let’s prepare our first simulation of virtual species.

We want to generate a virtual species with two environmental variables, the annual mean temperature `bio1`

and annual mean precipitation `bio2`

. We want to generate a tropical species, i.e., living in hot and humid environments. We can define bell-shaped response functions to these two variables, as in the following figure:

```
## Loading required package: raster
## Loading required package: sp
```

*Note that bioclimatic temperature variables (here bio1) are is in °C * 10, as explained here.*

These bell-shaped functions are in fact gaussian distributions functions of the form:

\[ f(x, \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}}\, e^{-\frac{(x - \mu)^2}{2 \sigma^2}}\]

If we take the example of bio1 above, the mean \(\mu\) is equal to 250 (i.e., 25°C) and standard deviation \(\sigma\) is equal to 50 (5°C). Hence the response function for bio1 is:

\[ f(bio1) = \frac{1}{50\sqrt{2\pi}}\, e^{-\frac{(bio1 - 250)^2}{2 \times 50^2}} \]

In R, it is very simple to obtain the result of the equation above, with the function `dnorm`

:

```
# Suitability of the environment for bio1 = 15 °C
dnorm(x = 150, mean = 250, sd = 50)
```

`## [1] 0.001079819`

The idea with `virtualspecies`

is to keep the same simplicity when generating a species: we will use the `dnorm`

function to generate our virtual species.

Let’s start with the package now: The first step is to provide to the helper function `formatFunctions`

which responses you want for which variables:

```
library(virtualspecies)
my.parameters <- formatFunctions(bio1 = c(fun = 'dnorm', mean = 250, sd = 50),
bio12 = c(fun = 'dnorm', mean = 4000, sd = 2000))
```

After that, the generation of the virtual species is fairly easy:

```
# Generation of the virtual species
my.first.species <- generateSpFromFun(raster.stack = worldclim[[c("bio1", "bio12")]],
parameters = my.parameters,
plot = TRUE)
```

`my.first.species`

```
## Virtual species generated from 2 variables:
## bio1, bio12
##
## - Approach used: Responses to each variable
## - Response functions:
## .bio1 [min=-269; max=314] : dnorm (mean=250; sd=50)
## .bio12 [min=0; max=9916] : dnorm (mean=4000; sd=2000)
## - Each response function was rescaled between 0 and 1
## - Environmental suitability formula = bio1 * bio12
## - Environmental suitability was rescaled between 0 and 1
```

Congratulations! You have just generated your first virtual species distribution, with the default settings. In the next section you will learn about the simple, yet important settings of this function.

The function `generateSpFromFun`

proceeds into four important steps:

- The response to each environmental variable is calculated according to the user-chosen functions.
- Each response is rescaled between 0 and 1. This step can be disabled.
- The responses are combined together to compute the environmental suitability. The user can choose how to combine the responses.
- The environmental suitability is rescaled between 0 and 1. This step can be disabled.

You can use any existing function with `generateSpFromFun`

, as long as you provide the necessary parameters. For example, you can use the function `dnorm`

shown above, if you provide the parameters `mean`

and `sd`

: `formatFunctions(bio1 = c(fun = 'dnorm', mean = 250, sd = 50))`

. Naturally you can change the values of `mean`

and `sd`

to your needs.

You can also use the basic functions provided with the package:

Linear function:

`formatFunctions(bio1 = c(fun = 'linearFun', a = 1, b = 0))`

\[ f(bio1) = a * bio1 + b \]Quadratic function:

`formatFunctions(bio1 = c(fun = 'quadraticFun', a = -1, b = 2, c = 0))`

\[ f(bio1) = a \times bio1^2 + b \times bio1 + c\]- Logistic function:
`formatFunctions(bio1 = c(fun = 'logisticFun', beta = 150, alpha = -5))`

\[ f(bio1) = \frac{1}{1 + e^{\frac{bio1 - \beta}{\alpha}}} \] Normal function defined by extremes:

`formatFunctions(bio1 = c(fun = 'custnorm', mean = 250, diff = 50, prob = 0.99))`

This function allows you to set extremum of a normal curve. In the example above, we define a response where the optimum is 25°C (`mean = 250`

), and 99% of the area under the curve (`prob = 0.99`

) will be comprised between 20 and 30°C (`diff = 50`

).Beta response function (Oksanen & Minchin, 2002,

*Ecological Modelling***157**:119-129):`formatFunctions(bio1 = c(fun = 'betaFun', p1 = 0, p2 = 35, alpha = 0.9, gamma = 0.08))`

\[ f(bio1) = (bio1 - p1)^{\alpha} * (p2 - bio1)^{\gamma} \]Or you can create your own functions (see the section How to create your own response functions if you need help for this).

This rescaling is performed because if you use very different response function for different variables, (*e.g.*, a Gaussian distribution function with a linear function), then the responses may be disproportionate among variables. By default this rescaling is enabled (`rescale.each.response = TRUE`

), but it can be disabled (`rescale.each.response = FALSE`

).

There are three main possibilities to combine response functions to calculate the environmental suitability, as defined by the parameters `species.type`

and `formula`

:

`species.type = "additive"`

: the response functions are added.`species.type = "multiplicative"`

: the response functions are multiplied. This is the default behaviour of the function.`formula = "bio1 + 2 * bio2 + bio3"`

: if you choose a formula, then the response functions are combined according to your formula (parameter`species.type`

is then ignored).

For example, if you want to generate a species with the same partial responses as in 2.1, but with a strong importance for temperature, then you can specify the formula : `formula = "2 * bio1 + bio12"`

```
library(virtualspecies)
my.parameters <- formatFunctions(bio1 = c(fun = 'dnorm', mean = 250, sd = 50),
bio12 = c(fun = 'dnorm', mean = 4000, sd = 2000))
# Generation of the virtual species
new.species <- generateSpFromFun(raster.stack = worldclim[[c("bio1", "bio12")]],
parameters = my.parameters,
formula = "2 * bio1 + bio12",
plot = TRUE)
```

`## [1] "2 * bio1 + bio12"`

`new.species`

```
## Virtual species generated from 2 variables:
## bio1, bio12
##
## - Approach used: Responses to each variable
## - Response functions:
## .bio1 [min=-269; max=314] : dnorm (mean=250; sd=50)
## .bio12 [min=0; max=9916] : dnorm (mean=4000; sd=2000)
## - Each response function was rescaled between 0 and 1
## - Environmental suitability formula = 2 * bio1 + bio12
## - Environmental suitability was rescaled between 0 and 1
```

One can even make complex interactions between partial responses:

```
library(virtualspecies)
my.parameters <- formatFunctions(bio1 = c(fun = 'dnorm', mean = 250, sd = 50),
bio12 = c(fun = 'dnorm', mean = 4000, sd = 2000))
# Generation of the virtual species
new.species <- generateSpFromFun(raster.stack = worldclim[[c("bio1", "bio12")]],
parameters = my.parameters,
formula = "3.1 * bio1^2 - 1.4 * sqrt(bio12) * bio1",
plot = TRUE)
```

`## [1] "3.1 * bio1^2 - 1.4 * sqrt(bio12) * bio1"`