Massive Eco-evolutionary Synthesis Simulations (MESS)

MESS is a novel comparative phylogeographic model grounded in community ecological theory. This integrative approach makes use of four data axes (distributions of traits, abundances, genetic diversities/divergences, and phylogenetic patterns) to enable testing alternative community assembly models (neutral vs non-neutral) and estimating parameters underlying different assembly processes (e.g. dispersal vs in situ speciation). This method capitalizes on the widespread use of DNA barcoding and meta-barcoding approaches.

What kind of data it requires

MESS requires sequence data from population-level sampling (5-10 individuals per species) from one or multiple ecological communities of organims. This can be at a variety of scales ranging from a microbial community within a host individual, a locally sampled plot targeting everything from a taxonomic group, to a regional assemblage that emerged via disersal and/or local speciation. Currently only single locus data is supported, so community metabarcoding projects would be quite appropriate. Other data types can be included but are not required (abundances, per taxon trait metrics, and phylogenetic information).

Software

MESS is implemented in python and the code is available on github.

Try it now!

Experiment with the MESS model now! Launch the binder instance below and you can open and run the notebooks in the jupyter-notebooks directory.

https://mybinder.org/badge_logo.svg

References

The MESS manuscript is available on bioarxiv:

Overcast I, Ruffley M, Rosindell J, Harmon L, Borges PA, Emerson BC, ... &
    Rominger A. (2020). A unified model of species abundance, genetic
    diversity, and functional diversity reveals the mechanisms structuring
    ecological communities. BioRxiv.

MESS is based on previous work on the gimmeSAD joint neutral model and the CAMI trait-mediated community assembly model which can be found here:

Overcast I, Emerson BC, Hickerson MJ. (2019). An integrated model of
    population genetics and community ecology. Journal of Biogeography,
    46: 816-829. https://doi.org/10.1111/jbi.13541

Ruffley M, Peterson K, Week B, Tank DC, & Harmon LJ. (2019). Identifying
    models of trait‐mediated community assembly using random forests and
    approximate Bayesian computation. Ecology and Evolution, 9(23), 13218-
    13230.

Installation

MESS requires Python >= 3.8. Binaries can be downloaded with conda (recommended) or built with pip.

Method 1: Install pre-built binary from conda

1. Download and install anaconda or miniconda.

2. (Optional) create a separate conda environment to install into:

conda create -n mess-env python=3.8
conda activate mess-env
  1. Install:
conda install -c mess -c conda-forge mess

See venv to install into a virtual environment.

Massive Eco-Evolutionary Synthesis Simulations (MESS) - CLI Tutorial

This is the first part of the full tutorial for the command line interface (CLI) for MESS. In this tutorial we’ll walk through the basics of the process of generating simulations. This is meant as a broad introduction to familiarize users with the general workflow, and some of the parameters and terminology. We will use as a template example the the spider community dataset from La Reunion published by Emerson et al (2017). However, you can follow along with one of the other example datasets if you like, the procedure will be identical although your results will vary.

  • Quickstart guide for the impatient
  • Equilibrium theory of island biogeography and ecological neutral theory
  • Overview of MESS simulation and analysis workflow
  • Installation
  • Getting started with the MESS CLI
  • Create and edit a new parameters file
  • Run simulations using your edited params file
  • Inspect the output of the simulation runs
  • Setting prior ranges on parameters

Each grey cell in this tutorial indicates a command line interaction. Lines starting with $ indicate a command that should be executed in a terminal connected to the cluster, for example by copying and pasting the text into your terminal. Elements in code cells surrounded by angle brackets (e.g. <mystuffs>) are variables that need to be replaced by the user. All lines in code cells beginning with ## are comments and should not be copied and executed. All other lines should be interpreted as output from the issued commands.

## Example Code Cell.
## Create an empty file in my home directory called `watdo.txt`
$ touch ~/watdo.txt

## Print "wat" to the screen
$ echo "wat"
wat

Quickstart guide for the impatient

TL;DR: Just show me how to do the simulations! Say you’re impatient and want to skip right to the good stuff, well here you go.

## Create a parameters file
$ MESS -n simdata
## Do 10 simulations using the default settings and 4 cores
$ MESS -p params-simdata.txt -s 10 -c 4

-------------------------------------------------------------
 MESS [v.0.1.1]
 Massive Eco-Evolutionary Synthesis Simulations
-------------------------------------------------------------
 Project directory exists. Additional simulations will be appended.

   <MESS.Region simdata: ['island1']>
 establishing parallel connection:
 host compute node: [4 cores] on goatzilla
   Generating 10 simulation(s).
 [####################] 100%  Performing Simulations    | 0:00:28 |
 [####################] 100%
   Finished 10 simulations

BAM!

Equilibrium theory of island biogeography and ecological neutral theory

Before we begin, if you are unfamiliar with community ecology theory it might be useful to review a brief google presentation given at a recent workshop to provied background and a brief introduction to the MESS model:

_images/Forward_Time_Neutral_Assembly.png

A classic individual based birth/death/colonization model of community assembly. For each timestep one individual is randomly sampled to ‘die’ and is replaced with some small probability by a colonizer from the regional pool, and otherwise is replaced by the offspring of a random individual sampled from the local community.

Overview of MESS simulation and analysis workflow

The basic steps of this process are as follows. In this intro tutorial we will only cover steps 1 and 2, but after you are comfortable with generating simulations you may proceed to the ML inference tutorial <inference.md> to learn about the remaining steps:

  • Step 1 - Set parameters based on prior knowledge of empirical system
  • Step 2 - Run mega simulations
  • Step 3 - Use ML to infer community assembly process
  • Setp 4 - Use ML to estimate key community assembly parameters
  • Step 5 - ???
  • Step 6 - Profit!!

Installation

MESS is distributed as a conda package so installation is simple and straightforward. If you don’t already have conda and/or MESS installed, please take a moment to install the software.

Getting started with the MESS CLI

From here on out we assume you are sitting at a terminal in a conda environment with MESS installed, ok?

To better understand how to use MESS, let’s take a look at the -h argument to seek ‘help’. We will use some of the MESS command line arguments in this tutorial (for example: -n, -p, -s, -c). The complete list of optional arguments and their explanation can be accessed with the -h flag:

$ MESS -h
usage: MESS [-h] [-n new] [-p params] [-s sims] [-c cores] [-r] [-e empirical]
            [-f] [-q] [-Q] [-d] [-l] [--ipcluster [ipcluster]] [--fancy-plots]

optional arguments:
  -h, --help            show this help message and exit
  -n new                create new file 'params-{new}.txt' in current
                        directory
  -p params             path to params file simulations: params-{name}.txt
  -s sims               Generate specified number of simulations
  -c cores              number of CPU cores to use (Default=0=All)
  -r                    show status of this simulation run
  -e empirical          Validate and import empirical data.
  -f                    force overwrite of existing data
  -q                    do not print to stderror or stdout.
  -Q                    do not print anything ever.
  -d                    print lots more info to mess_log.txt.
  -l                    Write out lots of information in one directory per
                        simulation.
  --ipcluster [ipcluster]
                        connect to ipcluster profile
  --fancy-plots         Construct fancy plots and animated gifs.

  * Example command-line usage:
    MESS -n data                       ## create new file called params-data.txt
    MESS -p params-data.txt            ## run MESS with settings in params file
    MESS -p params-data.txt -f         ## run MESS, overwrite existing data.

Create and edit a new parameters file

MESS uses a text file to hold all the parameters for a given community assembly scenario. Start by creating a new parameters file with the -n flag. This flag requires you to pass in a name for your simulations. In the example we use simdata but the name can be anything at all. Once you start analysing your own data you might call your parameters file something more informative, like the name of your target community and some details on the settings.

$ cd ~
$ mkdir MESS
$ cd MESS

# Create a new params file named 'simdata'
$ MESS -n simdata

This will create a file in the current directory called params-simdata.txt. The params file lists, on each line, one parameter followed by a ## mark, then the name of the parameter and then a short description of its purpose. Lets take a look at it:

$ cat params-simdata.txt
------- MESS params file (v.0.1.1)---------------------------------------------
simdata              ## [0] [simulation_name]: The name of this simulation scenario
./default_MESS       ## [1] [project_dir]: Where to save files
0                    ## [2] [generations]: Duration of simulations. Values/ranges Int for generations, or float [0-1] for lambda.
neutral              ## [3] [community_assembly_model]: Model of Community Assembly: neutral, filtering, competition
point_mutation       ## [4] [speciation_model]: Type of speciation process: none, point_mutation, protracted, random_fission
2.2e-08              ## [5] [mutation_rate]: Mutation rate scaled per base per generation
2000                 ## [6] [alpha]: Abundance/Ne scaling factor
570                  ## [7] [sequence_length]: Length in bases of the sequence to simulate
------- Metacommunity params: --------------------------------------------------
100                  ## [0] [S_m]: Number of species in the regional pool
750000               ## [1] [J_m]: Total # of individuals in the regional pool
2                    ## [2] [speciation_rate]: Speciation rate of metacommunity
0.7                  ## [3] [death_proportion]: Proportion of speciation rate to be extinction rate
2                    ## [4] [trait_rate_meta]: Trait evolution rate parameter for metacommunity
1                    ## [5] [ecological_strength]: Strength of community assembly process on phenotypic change
------- LocalCommunity params: island1------------------------------------------
island1              ## [0] [name]: Local community name
1000                 ## [1] [J]: Number of individuals in the local community
0.01                 ## [2] [m]: Migration rate into local community
0                    ## [3] [speciation_prob]: Probability of speciation per timestep in local community
Note: What’s the difference between a CLI argument and a MESS params file parameter, you may be asking yourself? Well, MESS CLI arguments specify how the simulations are performed (e.g. how many to run, how many cores to use, whether to print debugging information, etc), whereas MESS params file parameters dictate the structure of the simulations to run (e.g. sizes of communities, migration rates, specation rates, etc).

The defaults are all values of moderate size that will generate ‘normal’ looking simulations, and we won’t mess with them for now, but lets just change a couple parameters to get the hang of it. Why don’t we change the name parameter of the local community, ‘island1’ is so generic!. Pick your favorite island and change the name to this. Let’s also set J (size of the local community in individuals) equal to 500 as this will speed up the simulations (smaller local communities reach equilibrium faster).

We will use the nano text editor to modify params-simdata.txt and change this parameter:

$ nano params-simdata.txt

Nano is a command line editor, so you’ll need to use only the arrow keys on the keyboard for navigating around the file. Nano accepts a few special keyboard commands for doing things other than modifying text, and it lists these on the bottom of the frame. After you are done making the changes your file will now have lines that look like this:

La_Reunion    ## [0] [name]: Local community name
500           ## [1] [J]: Number of individuals in the local community
Note: For scientific computing, in almost all cases, spaces in variable names and labels should be considered harmful. Notice here how I replace the space in ‘La Reunion’ with an underscore (_) character, this is common practice that you should adopt.

After you change this parameters you may save and exit nano by typing CTRL+o (to write Output), and then CTRL+x (to eXit the program).

Note: The CTRL+x notation indicates that you should hold down the control key (which is often styled ‘ctrl’ on the keyboard) and then push ‘x’.

Once we start running the simulations and performing MESS analyses all the temp files and directories it needs are created in the project_dir directory and use the prefix specified by the simulation_name parameter. Because we use the default (./default_MESS) for the project_dir for this tutorial, all these intermediate directories will be of the form: ~/MESS/default_MESS/simdata_*, or the analagous name that you used for your assembly name.

Note on files in the project directory: MESS relies on the integrity of the project_directory for keeping track of various temporary files used by the simulation/analysis process. One result of this is that you can have multiple simulations of the same community assembly scenario using different parameter settings and you don’t have to manage all the files yourself! Another result is that you should not rename or move any of the files or directories inside your project directory, unless you know what you’re doing or you don’t mind if your simulations/analyses break.

Run simulations using your edited params file

Here we will start small and generate 10 simulations using 4 cores, just to get practice running the sims. After this command finishes we’ll have a batch of 10 new summary statistics in our default_MESS directory:

Special Note: In command line mode please be aware to always specify the number of cores with the -c flag. If you do not specify the number of cores, MESS assumes you want only one of them, which will result in painfully slow simulation runs (serial processing).
## -p    the params file we wish to use
## -s    the number of simulations to perform
## -c    the number of cores to allocate   <-- Important!
$ MESS -p params-simdata.txt -s 10 -c 4
 -------------------------------------------------------------
  MESS [v.0.1.1]
  Massive Eco-Evolutionary Synthesis Simulations
 -------------------------------------------------------------
  Project directory exists. Additional simulations will be appended.

    <MESS.Region simdata: ['La_Reunion']>
  establishing parallel connection:
  host compute node: [4 cores] on goatzilla
    Generating 10 simulation(s).
  [####################] 100%  Performing Simulations    | 0:00:46 |
  [####################] 100%
    Finished 10 simulations
Note: You can see here that MESS is intelligently handling all the parallelization work for you. You tell it how many cores to use with the -c flag and it portions out simulations among all the cores as they become available.

Inspect the output of the simulation runs

Simulation parameters and summary statistics are written to the SIMOUT.txt file in the project_dir, which is by defualt created in the current woring directory as ./default_MESS. You can check the length of this file:

$ wc -l default_MESS/SIMOUT.txt
11 default_MESS/SIMOUT.txt

# Use `less` to look inside the file. Use `q` to quit less when you are done.
less default_MESS/SIMOUT.txt
NB: Lines in this file are very long, so less will wrap the text by default. Turn of line wrapping by typing -S then pushing enter, which directs less to turn off line-wrapping. Makes it easier to read.
S_m     J_m     speciation_rate death_proportion        trait_rate_meta ecological_strength     generations     community_assembly_model
100     750000  2.0     0.7     2.0     1.0     0.0     neutral point_mutation  0.0     2000    570.0   500.0   0.01    0.0     189.0   0.696
100     750000  2.0     0.7     2.0     1.0     0.0     neutral point_mutation  0.0     2000    570.0   500.0   0.01    0.0     43.0    0.238

Setting prior ranges on parameters

Rather than explicitly specifying MESS parameters, let’s say you’re interested in actually estimating parameters from the observed data. We can do this by simulating over a range of values for each parameter of interest, and then using the MESS inference procedure to estimate these paramters. Let’s say you would like to estimate the size of the local community (J) and the migration rate into the local community (m). Edit your params file again with nano:

nano params-simdata.txt

and change the following two parameter settings. This time we specify a range of values:

1000-2000           ## [1] [J]: Number of individuals in the local community
0.001-0.01          ## [2] [m]: Migration rate into local community
Note: Saving and quitting from nano: CTRL+o then CTRL+x

Now run some more simulations (by default MESS will append these new simulations to the extant SIMOUT.txt file):

$ MESS -p params-simdata.txt -s 10 -c 4
 -------------------------------------------------------------
  MESS [v.0.1.1]
  Massive Eco-Evolutionary Synthesis Simulations
 -------------------------------------------------------------
  Project directory exists. Additional simulations will be appended.

    <MESS.Region simdata: ['La_Reunion']>
  establishing parallel connection:
  host compute node: [4 cores] on goatzilla
    Generating 10 simulation(s).
  [####################] 100%  Performing Simulations    | 0:00:46 |
  [####################] 100%
    Finished 10 simulations
 Clean up ipcluster <ipyparallel.client.client.Client object at 0x7f15cc3c9090>

Let’s use cut to look at just the columns we’re interested in (J and m), which are the 13th and 14th columns.

$ cut -f 13,14 default_MESS/SIMOUT.txt
J       m
500.0   0.01
500.0   0.01
500.0   0.01
500.0   0.01
500.0   0.01
500.0   0.01
500.0   0.01
500.0   0.01
500.0   0.01
500.0   0.01
1118.0  0.00205
1168.0  0.00172
1515.0  0.00323
1061.0  0.0014
1305.0  0.00859
1434.0  0.00881
1397.0  0.00706
1096.0  0.00509
1889.0  0.00112
1699.0  0.00285

And you’ll see that these parameter values are now taking a range, as we specified. Now you are ready to move on to MESS Machine Learning Inference where you will see how we can analyse massive amounts of simulations under varying parameter ranges with a machine learning framework to estimate parameters of the model from real data.

Massive Eco-Evolutionary Synthesis Simulations (MESS) - API Tutorial

This is the second part of the MESS tutorial in which we introduce the API mode using jupyter notebooks. This is meant as a broad introduction but does make the assumption you’ve already completed the CLI tutorial some of the parameters and terminology. We will use as an example in this tutorial the spider community data set from La Reunion published by Emerson et al (2017). However, you can follow along with one of the other example datasets if you like, the procedure will be identical although your results will vary.

Running and connecting to a Jupyter notebook

For the purposes of this tutorial, all command interactions will take place inside a jupyter notebook running on your personal computer. For the most part we will be writing and executing python commands. Jupyter is already installed as a dependency of MESS, but if you need help getting a server running see the excellent Jupyter notebook documentation.

Create and parameterize a new MESS Region

MESS API mode lets you dive under the hood of the CLI mode a bit. You have all the power of the CLI mode, yet more flexibility. The first step in API mode is to create a MESS Region. A Region encompasses a very large Metacommunity and a much smaller local community which is connected to it by colonization. In creating a Region the only thing you’re required to pass in is a name, so lets go with “LaReunion” (No spaces!), as this is the region the empirical data is drawn from.

reunion = MESS.Region("LaReunion")
print(reunion.get_params())
------- MESS params file (v.0.1.0)----------------------------------------------
LaReunion            ## [0] [simulation_name]: The name of this simulation scenario
./default_MESS       ## [1] [project_dir]: Where to save files
0                    ## [2] [generations]: Duration of simulations. Values/ranges Int for generations, or float [0-1] for lambda.
neutral              ## [3] [community_assembly_model]: Model of Community Assembly: neutral, filtering, competition
point_mutation       ## [4] [speciation_model]: Type of speciation process: none, point_mutation, protracted, random_fission
2.2e-08              ## [5] [mutation_rate]: Mutation rate scaled per base per generation
2000                 ## [6] [alpha]: Abundance/Ne scaling factor
570                  ## [7] [sequence_length]: Length in bases of the sequence to simulate
------- Metacommunity params: --------------------------------------------------
100                  ## [0] [S_m]: Number of species in the regional pool
750000               ## [1] [J_m]: Total # of individuals in the regional pool
2                    ## [2] [speciation_rate]: Speciation rate of metacommunity
0.7                  ## [3] [death_proportion]: Proportion of speciation rate to be extinction rate
2                    ## [4] [trait_rate_meta]: Trait evolution rate parameter for metacommunity
1                    ## [5] [ecological_strength]: Strength of community assembly process on phenotypic change
------- LocalCommunity params: Loc1---------------------------------------------
Loc1                 ## [0] [name]: Local community name
1000                 ## [1] [J]: Number of individuals in the local community
0.01                 ## [2] [m]: Migration rate into local community
0                    ## [3] [speciation_prob]: Probability of speciation per timestep in local community

These are all the parameters of the model. The defaults are chosen to reflect a typical oceanic island arthropod community. Don’t worry at this point about all the parameters, lets focus for now on community_assembly_model, the size of the local community (J), and the rate of migration from the metacommunity to the local community (m). We will set parameter ranges for these, and each simulation will sample a random value from this range. In a new cell use the set_param() method to change these values:

reunion.set_param("community_assembly_model", "*")
reunion.set_param("J", "1000-10000")
reunion.set_param("m", "0.001-0.01")
NB: Setting the community_assembly_model to * indicates that we want to sample uniformly among all three of the model types: neutral, competition, and environmental filtering.

Print the params again to prove to yourself that the ranges are now set:

print(reunion.get_params())
------- MESS params file (v.0.1.0)----------------------------------------------
LaReunion            ## [0] [simulation_name]: The name of this simulation scenario
./default_MESS       ## [1] [project_dir]: Where to save files
0                    ## [2] [generations]: Duration of simulations. Values/ranges Int for generations, or float [0-1] for lambda.
*                    ## [3] [community_assembly_model]: Model of Community Assembly: neutral, filtering, competition
point_mutation       ## [4] [speciation_model]: Type of speciation process: none, point_mutation, protracted, random_fission
2.2e-08              ## [5] [mutation_rate]: Mutation rate scaled per base per generation
2000                 ## [6] [alpha]: Abundance/Ne scaling factor
570                  ## [7] [sequence_length]: Length in bases of the sequence to simulate
------- Metacommunity params: --------------------------------------------------
100                  ## [0] [S_m]: Number of species in the regional pool
750000               ## [1] [J_m]: Total # of individuals in the regional pool
2                    ## [2] [speciation_rate]: Speciation rate of metacommunity
0.7                  ## [3] [death_proportion]: Proportion of speciation rate to be extinction rate
2                    ## [4] [trait_rate_meta]: Trait evolution rate parameter for metacommunity
1                    ## [5] [ecological_strength]: Strength of community assembly process on phenotypic change
------- LocalCommunity params: Loc1---------------------------------------------
Loc1                 ## [0] [name]: Local community name
1000-10000           ## [1] [J]: Number of individuals in the local community
0.001-0.01           ## [2] [m]: Migration rate into local community
0                    ## [3] [speciation_prob]: Probability of speciation per timestep in local community

Run MESS serial simulations in API mode

Now we can run community assembly simulations given our new parameterization using the run() method. There is one required argument to this method (nsims) which indicates the number of independent community assembly realizations to perform.

reunion.run(sims=1)
   Generating 1 simulation(s).
[####################] 100%  Finished 0 simulations    | 0:01:02 |

Run MESS parallel simulations in API mode

Like the CLI, the MESS API can make use of all the cores you can throw at it thanks to integration with the very nice IPyparallel library. To take a moment to launch an ipcluster instance.

Now we assume you have an ipyclient object initialized in your notebook. The Region.run() method can also an optional argument (ipyclient) for specifying a connection to an ipyparallel backend, allowing for massive parallelization. Let’s check to make sure how many cores our ipyclient is attached to:

len(ipyclient)
40

Now call run and generate 40 simulations on the 40 cores:

reunion.run(sims=40, ipyclient=ipyclient)
  Generating 40 simulation(s).
[####################] 100%  Performing Simulations    | 0:01:31 |
[####################] 100%
  Finished 40 simulations

Now we generated 40 simulations in the parallel in the (roughly) the same time it took to generate 1 simulation in serial. I say ‘roughly’ here for two reasons. First, The simulations are stochastic, and the amount of time any given simulation will take is Poisson distributed, so sometimes you’l get ‘unlucky’ with one that takes much longer. Second, by default the generations parameter is 0, which indicates to uniformly sample a lambda value from the half-open interval [0-1). Small lambda will (on average) run faster than large lambda, so again, another source of variability in runtime.

Adding more simulations to a region

As with the CLI mode, if you find you need to add more simulations to a Region, for whatever reason, you can simply call run() again, and this will append the new simulations to what has already been run.

Using simulations to perform statistical inference

You can now proceed to the MESS Machine Learning Tutorial to learn how to use the simulations to perform model selection and parameter estimation on real data.

References

Emerson, B. C., Casquet, J., López, H., Cardoso, P., Borges, P. A.,
    Mollaret, N., … & Thébaud, C. (2017). A combined field survey and
    molecular identification protocol for comparing forest arthropod
    biodiversity across spatial scales. Molecular ecology resources, 17(4),
    694-707.

MESS Machine Learning (ML) Inference

Once you have performed a sufficient number of simulations you can now use these simulations to perform community assembly model selection and parameter estimation for observed community data. In this short tutorial we will demonstrate the fundamentals of how this works.

The MESS.inference architecture is based on the powerful and extensive scikit-learn python machine learning library. Another really amazing resource I highly recommend is the Python Data Science Handbook.

Download and examine example empirical data

We will be using as an example dataset of community-scale COI sequences (~500bp) and densely sampled abundances for the spider community on the island of La Reunion, an overseas department of France, and the largest of the Mascarene islands, located in the Indian Ocean approximately 1000 km from Madagascar. The data we will be using was collected and published by Emerson et al (2017). For this exercise, we will use the pre-processed summary statistics which were generated from the real data and which are available in the MESS github repository. For further instruction on properly formatting and converting raw data into MESS-ready format see the MESS raw data handling page.

In a new cell in your notebook you can download the Reunion spider data like this:

!wget https://raw.githubusercontent.com/messDiv/MESS/master/empirical_data/Reunion_spiders/spider.dat
NB: The ! prior to the command here indicates that the jupyter notebook should interpret this as a bash command executed at the command line. This is a handy way of executing bash commands from within the notebook environment, rather than returning to a terminal window on the cluster. It’s just a nice shortcut.

Now make a new cell and import MESS and pandas (which is a python structured data library providing the DataFrame class), and read in the data you just downloaded.

%matplotlib inline
import MESS
import pandas as pd
spider_df = pd.read_csv("spider.dat", index_col=0)
spider_df[:5]

Special Note: The %matplotlib inline command tells jupyter to draw the figures actually inside your notebook environment. If you don’t include this your figures won’t show up!

NB: Importing pandas as pd is pretty cannonical. It makes typing out pandas commands shorter because you can reference it as pd rather than pandas.

NB: The final line in the above command asks python to display the first 5 rows of the spider_df dataframe. It should look like this:

_images/Reunion_spider_df.png

Fetch pre-baked simulations

Since it can take quite some time to run a number of simulations sufficient for model selection and parameter estimation we will use a suite of pre-baked simulations generated ahead of time. This dataset is composed of ~8k simulations under each of the community assembly models with a few of the parameters sampled from distributions and the rest set to fixed values. Fetch this dataset from the compphylo site with wget:

!wget https://compphylo.github.io/Oslo2019/MESS_files/MESS_simulations/SIMOUT.txt
!wc -l SIMOUT.txt
100%[======================================>] 14,234,338  50.0M/s   in 0.3s
2019-08-11 01:25:27 (50.0 MB/s) - "SIMOUT.txt" saved [14234338/14234338]
24440 SIMOUT.txt
NB: The wc command counts the number of lines if you pass it the -l flag. You can see this series of ~25k simulations is about 14MB.

Supported ensemble methods

MESS currently supports three different scikit-learn ensemble methods for classification and parameter estimation:

  • RandomForest (rf):
  • GradientBoosting (gb):
  • AdaBoost (ab):

ML assembly model classification

The first step is now to assess the model of community assembly that best fits the data. The three models are neutral, in which all individuals are ecologically equivalent; competition, in which species have traits, and differential survival probability is weighted by distance of traits from the trait mean in the local community (closer to the trait mean == higher probability of death); and filtering, in which there is an environmental optimum, and proximity of trait values to this optimum is positively correlated with survival probability.

Basically we want to know, are individuals in the local community ecologically equivalent, and if not are, they interacting more with each other or more with the local environment.

Here we create a MESS.inference.Classifier object and pass in the empirical dataframe, the simulations, and specify RandomForest (rf) as the algorithm to use (other options are GradientBoosting (gb) and AdaBoost (ab)). Full documentation of all the machine learning inference architecture is available in the MESS API specification.

cla = MESS.inference.Classifier(empirical_df=spider_df, simfile="SIMOUT.txt", algorithm="rf")
est, proba = cla.predict(select_features=False, param_search=False, quick=True, verbose=True)

The Classifier.predict() method provides a sophisticated suite of routines for automating machine learning performance tuning. Here we disable all of this in order for it to run quickly, trading accuracy for performance. The select_features argument controls whether or not to prune summary statistics based on correlation and informativeness. The param_search argument controls whether or not to exhaustively search the space of parameters to maximize accuracy of the ML algorithm selected. Finally, the quick argument subsets the size of the simulation database to 1000 simulations, again as a means of running quickly to verify, for example, whether all the input data and simulations are formatted properly. In normal conditions, when you want to investigate real data you should use select_featers=True, param_search=True, quick=False. This will result in much more accurate inference, yet will take much more time to run.

The call to cla.predict() returns both the most likely community assembly model (est) and the model probabilities (proba), which you can view:

from IPython.display import display
display(est, proba)
NB: The first line of this codeblock imports the display method from the IPython library, which makes pd.DataFrames display nicely.
_images/MESSClassifier_results.png

Here we see that the neutral model is vastly favored over the competition and fitering models.

Now we can examine the weight of importance of each feature, an indication of the contribution of each summary statistic to the inference of the ML algorithm. You can also get a quick and dirty plot of feature importances directly from the Classifier object with the plot_feature_importances() method.

display(cla.feature_importances())
cla.plot_feature_importance(figsize=(5,5))
NB: The figsize argument controls the size of the output figure. The default is (10, 10), so here we just make it a bit smaller to fit better in the notebook window.
_images/MESSClassifier_feature_importance.png

The plot_feature_importance() method prunes out all the summary statistics that contribute less than 5% of information to the final model. This method is simply for visualization. Here you can see that the shape of the abundance distributions (as quantified by the first 4 Hill numbers) and the standard deviation of nucleotide diversity (pi) within the local community are the most important summary statistics for differentiating between the different community assembly models.

ML parameter estimation

Now that we have identified the neutral model as the most probable, we can estimate parameters of the emipirical data given this model. Essentially, we are asking the question “What are the parameters of the model that generate summary statistics most similar to those of the empirical data?”

The MESS.inference module provides the Regressor class, which has a very similar API to the Classifier. We create a Regressor and pass in the empirical data, the simulations, and the machine learning algorithm to use, but this time we also add the target_model which prunes the simulations to include only those of the community assembly model of interest. Again, we call predict() on the regressor and pass in all the arguments to make it run fast, but do a bad job.

rgr = MESS.inference.Regressor(empirical_df=spider_df, simfile="SIMOUT.txt", target_model="neutral", algorithm="rfq")
est = rgr.predict(select_features=False, param_search=False, quick=True, verbose=False)
NB: Note that here we ask for the rfq algorithm, which designates random forest quantile regression, and allows for constructing prediction intervals, something the stock rf algorithm doesn’t do. The gradient boosting method (gb) provides prediction intervals natively.

The return value est is a dataframe that contains predicted values for each model parameter and 95% prediction intervals, which are conceptually in the same ball park as maximum likelihood confidence intervals (CI) and and Bayesian highest posterior densities (HPD).

display(est)
_images/MESSRegressor_parameter_estimates.png

Similar to the classifier, you can also extract feature importances from the regressor, to get an idea about how each feature is contributing to parameter estimation. In this case, the feature importance plots are a little more useful than the giant table. Also note that we’re requesting a slighly larger figure, because this time there will be much more information.

display(rgr.feature_importances())
rgr.plot_feature_importance(figsize=(20,8))
_images/MESSRegressor_feature_importance.png
_images/MESSRegressor_plot_feature_importance.png

You can see that some parameters are strongly driven by one or a couple of features, and for some the feature importances are more diffuse. For example species richness (S) contributes overwhelmingly to estimation of local community size (J), which makes sense. On the other hand, many more factors seem to contribute equally to estimation of migration rate (m).

Perform posterior predictive simulations

Finally, a very important step in machine learning inference is to validate the parameter estimates by performing posterior predictive checks (PPCs). The logic of this procedure is that we will generate a suite of simulations using our most probable parameter estimates, and compare these simulations to our empirical data. For our purposes here we will project the summary statistics of observed and simulated datasets into principle component space. If the parameter estimates are a good fit to the data then our posterior predictive simulations will cluster together with the real data, whereas if the parameter estimates are a poor fit, then the real data will be quite different from the simulations.

Spoiler alert: Posterior predictive simulations can take quite a while to run. With the parameters we get even with this toy data a reasonable number of sims could take a couple hours. So! For the purpose of this exercise we are going to transform the estimated parameters to make the simulations run faster. DO NOT DO THIS WITH YOUR REAL DATA!

est["J"] /= 2
est["m"] *= 10
est["_lambda"] /= 2
est

Here we reduced the size of the local community (J), increased the rate of migration (m), and decreased the duration of the simulation (_lambda). All these things will make the simulations run more quickly. Now go ahead and run this next cell, it should take one to two minutes.

MESS.inference.posterior_predictive_check(empirical_df=spider_df,
                                          parameter_estimates=est,
                                          est_only=True,
                                          nsims=20,
                                          verbose=True,
                                          force=True)
  Generating 20 simulation(s).
[####################] 100%  Finished 19 simulations   | 0:02:57 |

  Calculating PCs and plotting
NB: The posterior_predictive_check() function can also accept an ipyclient parameter to specify to use an ipyparallel backend. For simplicity here we will not use this option, but in reality it would be a good idea to parallelize the posterior predictive simulations. See the MESS parallelization docs for more info about how to implement the parallel backend.

This time the only thing we have to pass in is the empirical data and the dataframe of prediction values, but we show a couple more arguments here for the sake of completeness. Setting est_only to True will use only the exact parameter estimates for all simulations, whereas setting it to False will sample uniformly between the upper and lower prediction interval for each simulation. nsims should be obvious, the number of posterior predictive simulations to perform. force will overwrite any pre-existing simulations, thus preventing mixing apples with oranges from different simulation runs. On the other hand if this is False (the default), then this is a handy way to add more PPCs to your analysis.

The posterior_predictive_check() method will also display the summary statistics for the simulated and observed data projected into PC space. The observed data is colored in red. Here is a pretty typical example of a good fit of parameters to the data:

_images/MESS_PPC_GoodFit.png

And here is a nice example of a poor fit of parameters to the data:

_images/MESS_PPC_BadFit.png

Quite a striking, and obvious difference.

Experiment with other example datasets

Now you may wish to experiment with some of the other empirical datasets which are provided on the MESS github repo, or with exploring the substantial MESS inference procedure API documentation where we show a ton of other cool stuff MESS is capable of that we just don’t have time to go over in this short tutorial.

References

Emerson, B. C., Casquet, J., López, H., Cardoso, P., Borges, P. A.,
    Mollaret, N., … & Thébaud, C. (2017). A combined field survey and
    molecular identification protocol for comparing forest arthropod
    biodiversity across spatial scales. Molecular ecology resources, 17(4),
    694-707.
MESS Parallelization Docs

Either run this by hand on a compute node or add this line to your jupyter notebook job submission script before the call to launch the notebook server:

ipcluster start -n 40 --cluster-id=MESS --daemonize

Here -n 40 specifies the number of cores to use, so you’ll need to change this to allocate the number you request in the job script.

Now inside the notebook you want to run simulations in, attach to the running ipyparallel instance like this:

try:
    ipyclient = ipp.Client(cluster_id="MESS")
    print(len(ipyclient))
except:
    pass

And now you can call use the parallel backend when running simulations inside a jupyter notebook:

import MESS
r = MESS.Region("Watdo")
r.run(nsims=1000, ipyclient=ipyclient)

In this miniature tutorial we will guide you through the steps of converting raw data into properly formatted MESS-friendly dataframes.

Put stuff here

MESS Model Parameters

The parameters contained in a params file affect the behavior of various parts of the forward-time and backward-time assembly process. The defaults that we chose are fairly reasonable values as as starting point, however, you will always need to modify at least a few of them (for example, to indicate the location of your data), and often times you will want to modify many of the parameters.

Below is an explanation of each parameter setting, the eco-evolutionary process that it affects, and example entries for the parameter into a params.txt file.

NB: Most parameters for which is it sensible to do so will accept a prior range which will be sampled from uniformly for each simulation (e.g. a value of 1000-5000 for parameter J will choose a new J for each simulation inclusively bounded by these values).

simulation_name

The simulation name is used as the prefix for all output files. It should be a unique identifier for this particular set of simulations, meaning the set of parameters you are using for the current data set. When I run multiple related simulations I usually use names indicating the specific parameter combinations (e.g., filtering_nospeciation, J5000_neutral).

Example: New community simulations are created with the -n options to MESS:

## create a new assembly named J1000_neutral
$ MESS -n J1000_neutral

project_dir

A project directory can be used to group together multiple related simulations. A new directory will be created at the given path if it does not already exist. A good name for project_dir will generally be the name of the community/system being studied. The project dir path should generally not be changed after simulations/analysis are initiated, unless the entire directory is moved to a different location/machine.

Example entries into params.txt:

/home/watdo/MESS/galapagos         ## [1] create/use project dir called galapagos
galapagos                          ## [1] create/use project dir called galapagos in cwd (./)

generations

This parameter specifies the amount of time to run forward-time simulations. It can be specified in a number of different ways, but overall time can be considered either in terms of Wright-Fisher (WF) generations or in terms of Lambda. For WF generations you should specify an integer value (or a range of integer values) which will run the forward-time process for WF * J / 2 time-steps (where a time-step is one birth/death/colonization/speciation event). For Lambda you may select either an exact Lambda value (a real value between 0 and 1 exclusive), or you can set generations equal to 0, which will draw random Lambda values between 0 and 1 for each simulation.

Example entries into params.txt:

0                   ## [2] [generations]: Sample random Lambda values for each simulation
100                 ## [2] [generations]: Run each simulation for 100 WF generations
50-100              ## [2] [generations]: Sample uniform between 50-100 WF generations for each simulation

community_assembly_model

With this parameter you may specify a neutral or non-neutral scenario for the forward time process. There are currently three different options for this parameter: neutral, filtering, or competition. The neutral case indicates full ecological equivalence of all species, so all individuals have an equal probability of death at each time-step. In the filtering and competition models survival probability is contingent on proximity of species trait values to the environmental optimum, or distance from the local trait mean, respectively. You may also use the wildcard * here and MESS will randomly sample one community assembly model for each simulation.

Example entries into params.txt:

neutral             ## [3] [community_assembly_model]: Select the neutral process forward-time
filtering           ## [3] [community_assembly_model]: Select the environmental filtering process
*                   ## [3] [community_assembly_model]: Randomly choose one of the community assembly models

speciation_model

Specify a speciation process in the local community. If none then no speciation happens locally. If point_mutation then one individual will instantaneously speciate at rate speciation_prob for each forward-time step. If random_fission then one lineage will randomly split into two lineages at rate speciation_prob with the new lineage receiving n = U~(1, local species abundance) individuals, and the parent lineage receiving 1 - n individuals. protracted will specify a model of protracted speciation, but this is as yet unimplemented.

Example entries into params.txt:

none                ## [4] [speciation_model]: No speciation in the local community
point_mutation      ## [4] [speciation_model]: Point mutation specation process

mutation_rate

Specify the mutation rate for backward-time coalescent simulation of genetic variation. This rate is the per base, per generation probability of a mutation under an infinite sites model.

Example entries into params.txt:

2.2e-08             ## [5] [mutation_rate]: Mutation rate scaled per base per generation

alpha

Scaling factor for transforming number of demes to number of individuals. alpha can be specified as either a single integer value or as a range of values.

Example entries to params.txt file:

2000                ## [6] [alpha]: Abundance/Ne scaling factor
1000-10000          ## [6] [alpha]: Abundance/Ne scaling factor

sequence_length

Length of the sequence to simulate in the backward-time process under an infinite sites model. This value should be specified based on the length of the region sequenced for the observed community data in bp.

Example entries to params.txt file:

570                 ## [7] [sequence_length]: Length in bases of the sequence to simulate

S_m

S_m specifies the total number of species to simulate in the metacommunity. Larger values will result in more singletons in the local community and reduced rates of multiple-colonization.

Example entries to params.txt file:

500                 ## [0] [S_m]: Number of species in the regional pool
100-1000            ## [0] [S_m]: Number of species in the regional pool

J_m

The total number of individuals in the metacommunity. This value is divided up among all S_m species following a logseries distribution. J_m values which are smaller will produce more even abundances in the metacommunity, which will result in more equal probability of colonization among species.

Example entries to params.txt:

50000               ## [1] [J_m]: Total # of individuals in the regional pool
50000-500000        ## [1] [J_m]: Total # of individuals in the regional pool

speciation_rate

The speciation rate parameter of the birth-death process which generates the metacommunity phylogeny.

Example entries to params.txt:

2                  ## [2] [speciation_rate]: Speciation rate of metacommunity

death_proportion

The rate of extinction proportional to the speciation rate for the birth-death metacommunity phylogeny.

Example entries to params.txt:

0.7                ## [3] [death_proportion]: Proportion of speciation rate to be extinction rate

trait_rate_meta

The variance of the Brownian motion process for evolving traits along the metacommunity phylogeny. Smaller values will result in more trait similarity within clades, and larger values will result in more overdispersed traits.

Example entries to params.txt:

2                  ## [4] [trait_rate_meta]: Trait evolution rate parameter for metacommunity

ecological_strength

This parameter dictates the strength of interactions in the environmental filtering and competition models. As the value of this parameter approaches zero, ecological strength is reduced and the assembly process increasingly resembles neutrality (ecological equivalence). Larger values increasingly bias probability of death against individuals with traits farther from the environmental optimum (in the filtering model).

In the following examples the environmental optimum is 3.850979, and the ecological strength is varied from 0.001 to 100. Column 0 is species ID, column 1 is trait value, column 2 is unscaled probability of death, and column 3 is proportional probability of death. Models with strength of 0.001 and 0.01 are essentially neutral. Strength of 0.1 confers a slight advantage to individuals very close to the local optimum (e.g. species ‘t97’).

_images/ecological_strength_0.001.png _images/ecological_strength_0.01.png _images/ecological_strength_0.1.png

Ecological strength of 1 (below, left panel) is noticeably non-neutral (e.g. ‘t97’ survival probability is 10x greater than average). A value of 10 for this parameter generates a _strong_ non-neutral process (below, center panel: ‘t97’ is 100x less likely to die than average, and the distribution of death probabilities is more varied). Ecological strength values >> 10 are _extreme_ and will probably result in degenerate behavior (e.g. strength of 100 (below, right panel) in which several of the species will be effectively immortal, with survival probability thousands of times better than average).

_images/ecological_strength_1.png _images/ecological_strength_10.png _images/ecological_strength_100.png

Example entries to params.txt:

1                  ## [5] [ecological_strength]: Strength of community assembly process on phenotypic change
0.001-1            ## [5] [ecological_strength]: Strength of community assembly process on phenotypic change

name

This is a label for the local community and is used to generate output file names. Generally not that useful unless you’re simulating multiple local communities.

Example entries to params.txt:

island1            ## [0] [name]: Local community name

J

The number of individuals in the local community. This value is fixed for the duration of a community assembly simulation. The size of J will have a strong impact on the consequent species richness and abundance distribution.

Example entries to params.txt:

1000-2000          ## [1] [J]: Number of individuals in the local community

m

The probability at any given timestep that a randomly sampled individual in the local community will be replaced by a migrant from the metacommunity. Larger values of m will generally result in increased species richness.

Example entries to params.txt:

0.01               ## [2] [m]: Migration rate into local community

speciation_prob

Given that a randomly sampled individual from the local community is being replaced by the offspring of another individual from the local community, this parameter defines the probability that this offspring individual will undergo a speciation process.

Example entries to params.txt:

0                  ## [3] [speciation_prob]: Probability of speciation per timestep in local community
0.0001-0.001       ## [3] [speciation_prob]: Probability of speciation per timestep in local community

API Documentation

This is the API documentation for MESS, and provides detailed information on the Python programming interface. See the Massive Eco-Evolutionary Synthesis Simulations (MESS) - API Tutorial for an introduction to using this API to run simulations.

Simulation model

Region
Metacommunity
Local Community

Inference Procedure

Model Selection (Classification)
Parameter Estimation (Regression)
Classification Cross-Validation
Parameter Estimation Cross-Validation
Posterior Predictive Checks

Stats

Plotting