Massive Eco-evolutionary Synthesis Simulations (MESS)¶
MESS is a novel comparative phylogeographic model grounded in community ecological theory. This integrative approach makes use of four data axes (distributions of traits, abundances, genetic diversities/divergences, and phylogenetic patterns) to enable testing alternative community assembly models (neutral vs non-neutral) and estimating parameters underlying different assembly processes (e.g. dispersal vs in situ speciation). This method capitalizes on the widespread use of DNA barcoding and meta-barcoding approaches.
What kind of data it requires¶
MESS requires sequence data from population-level sampling (5-10 individuals per species) from one or multiple ecological communities of organims. This can be at a variety of scales ranging from a microbial community within a host individual, a locally sampled plot targeting everything from a taxonomic group, to a regional assemblage that emerged via disersal and/or local speciation. Currently only single locus data is supported, so community metabarcoding projects would be quite appropriate. Other data types can be included but are not required (abundances, per taxon trait metrics, and phylogenetic information).
Try it now!¶
Experiment with the MESS model now! Launch the binder instance below and you can open and run the notebooks in the jupyter-notebooks directory.
References¶
The MESS manuscript is available on bioarxiv:
Overcast I, Ruffley M, Rosindell J, Harmon L, Borges PA, Emerson BC, ... &
Rominger A. (2020). A unified model of species abundance, genetic
diversity, and functional diversity reveals the mechanisms structuring
ecological communities. BioRxiv.
MESS is based on previous work on the gimmeSAD joint neutral model and the CAMI trait-mediated community assembly model which can be found here:
Overcast I, Emerson BC, Hickerson MJ. (2019). An integrated model of
population genetics and community ecology. Journal of Biogeography,
46: 816-829. https://doi.org/10.1111/jbi.13541
Ruffley M, Peterson K, Week B, Tank DC, & Harmon LJ. (2019). Identifying
models of trait‐mediated community assembly using random forests and
approximate Bayesian computation. Ecology and Evolution, 9(23), 13218-
13230.
Installation¶
MESS
requires Python >= 3.8. Binaries can be downloaded with conda (recommended) or built with pip.
Method 1: Install pre-built binary from conda¶
1. Download and install anaconda or miniconda.
2. (Optional) create a separate conda environment to install into:
conda create -n mess-env python=3.8
conda activate mess-env
- Install:
conda install -c mess -c conda-forge mess
See venv to install into a virtual environment.
Massive Eco-Evolutionary Synthesis Simulations (MESS) - CLI Tutorial¶
This is the first part of the full tutorial for the command line interface (CLI) for MESS. In this tutorial we’ll walk through the basics of the process of generating simulations. This is meant as a broad introduction to familiarize users with the general workflow, and some of the parameters and terminology. We will use as a template example the the spider community dataset from La Reunion published by Emerson et al (2017). However, you can follow along with one of the other example datasets if you like, the procedure will be identical although your results will vary.
- Quickstart guide for the impatient
- Equilibrium theory of island biogeography and ecological neutral theory
- Overview of MESS simulation and analysis workflow
- Installation
- Getting started with the MESS CLI
- Create and edit a new parameters file
- Run simulations using your edited params file
- Inspect the output of the simulation runs
- Setting prior ranges on parameters
Each grey cell in this tutorial indicates a command line interaction.
Lines starting with $
indicate a command that should be executed in
a terminal connected to the cluster, for example by copying and pasting
the text into your terminal. Elements in code cells surrounded by angle
brackets (e.g. <mystuffs>
) are variables that need to be replaced by
the user. All lines in code cells beginning with ## are comments and should
not be copied and executed. All other lines should be interpreted as output
from the issued commands.
## Example Code Cell.
## Create an empty file in my home directory called `watdo.txt`
$ touch ~/watdo.txt
## Print "wat" to the screen
$ echo "wat"
wat
Quickstart guide for the impatient¶
TL;DR: Just show me how to do the simulations! Say you’re impatient and want to skip right to the good stuff, well here you go.
## Create a parameters file
$ MESS -n simdata
## Do 10 simulations using the default settings and 4 cores
$ MESS -p params-simdata.txt -s 10 -c 4
-------------------------------------------------------------
MESS [v.0.1.1]
Massive Eco-Evolutionary Synthesis Simulations
-------------------------------------------------------------
Project directory exists. Additional simulations will be appended.
<MESS.Region simdata: ['island1']>
establishing parallel connection:
host compute node: [4 cores] on goatzilla
Generating 10 simulation(s).
[####################] 100% Performing Simulations | 0:00:28 |
[####################] 100%
Finished 10 simulations
BAM!
Equilibrium theory of island biogeography and ecological neutral theory¶
Before we begin, if you are unfamiliar with community ecology theory it might be useful to review a brief google presentation given at a recent workshop to provied background and a brief introduction to the MESS model:

A classic individual based birth/death/colonization model of community assembly. For each timestep one individual is randomly sampled to ‘die’ and is replaced with some small probability by a colonizer from the regional pool, and otherwise is replaced by the offspring of a random individual sampled from the local community.
Overview of MESS simulation and analysis workflow¶
The basic steps of this process are as follows. In this intro tutorial we will only cover steps 1 and 2, but after you are comfortable with generating simulations you may proceed to the ML inference tutorial <inference.md> to learn about the remaining steps:
- Step 1 - Set parameters based on prior knowledge of empirical system
- Step 2 - Run mega simulations
- Step 3 - Use ML to infer community assembly process
- Setp 4 - Use ML to estimate key community assembly parameters
- Step 5 - ???
- Step 6 - Profit!!
Installation¶
MESS is distributed as a conda package so installation is simple and straightforward. If you don’t already have conda and/or MESS installed, please take a moment to install the software.
Getting started with the MESS CLI¶
From here on out we assume you are sitting at a terminal in a conda environment with MESS installed, ok?
To better understand how to use MESS, let’s take a look at the -h
argument
to seek ‘help’. We will use some of the MESS command line arguments in this
tutorial (for example: -n
, -p
, -s
, -c
). The complete list of
optional arguments and their explanation can be accessed with the -h
flag:
$ MESS -h
usage: MESS [-h] [-n new] [-p params] [-s sims] [-c cores] [-r] [-e empirical]
[-f] [-q] [-Q] [-d] [-l] [--ipcluster [ipcluster]] [--fancy-plots]
optional arguments:
-h, --help show this help message and exit
-n new create new file 'params-{new}.txt' in current
directory
-p params path to params file simulations: params-{name}.txt
-s sims Generate specified number of simulations
-c cores number of CPU cores to use (Default=0=All)
-r show status of this simulation run
-e empirical Validate and import empirical data.
-f force overwrite of existing data
-q do not print to stderror or stdout.
-Q do not print anything ever.
-d print lots more info to mess_log.txt.
-l Write out lots of information in one directory per
simulation.
--ipcluster [ipcluster]
connect to ipcluster profile
--fancy-plots Construct fancy plots and animated gifs.
* Example command-line usage:
MESS -n data ## create new file called params-data.txt
MESS -p params-data.txt ## run MESS with settings in params file
MESS -p params-data.txt -f ## run MESS, overwrite existing data.
Create and edit a new parameters file¶
MESS uses a text file to hold all the parameters for a given community assembly
scenario. Start by creating a new parameters file with the -n
flag. This
flag requires you to pass in a name for your simulations. In the example we use
simdata
but the name can be anything at all. Once you start analysing your
own data you might call your parameters file something more informative, like
the name of your target community and some details on the settings.
$ cd ~
$ mkdir MESS
$ cd MESS
# Create a new params file named 'simdata'
$ MESS -n simdata
This will create a file in the current directory called params-simdata.txt
.
The params file lists, on each line, one parameter followed by a ## mark, then
the name of the parameter and then a short description of its purpose. Lets
take a look at it:
$ cat params-simdata.txt
------- MESS params file (v.0.1.1)---------------------------------------------
simdata ## [0] [simulation_name]: The name of this simulation scenario
./default_MESS ## [1] [project_dir]: Where to save files
0 ## [2] [generations]: Duration of simulations. Values/ranges Int for generations, or float [0-1] for lambda.
neutral ## [3] [community_assembly_model]: Model of Community Assembly: neutral, filtering, competition
point_mutation ## [4] [speciation_model]: Type of speciation process: none, point_mutation, protracted, random_fission
2.2e-08 ## [5] [mutation_rate]: Mutation rate scaled per base per generation
2000 ## [6] [alpha]: Abundance/Ne scaling factor
570 ## [7] [sequence_length]: Length in bases of the sequence to simulate
------- Metacommunity params: --------------------------------------------------
100 ## [0] [S_m]: Number of species in the regional pool
750000 ## [1] [J_m]: Total # of individuals in the regional pool
2 ## [2] [speciation_rate]: Speciation rate of metacommunity
0.7 ## [3] [death_proportion]: Proportion of speciation rate to be extinction rate
2 ## [4] [trait_rate_meta]: Trait evolution rate parameter for metacommunity
1 ## [5] [ecological_strength]: Strength of community assembly process on phenotypic change
------- LocalCommunity params: island1------------------------------------------
island1 ## [0] [name]: Local community name
1000 ## [1] [J]: Number of individuals in the local community
0.01 ## [2] [m]: Migration rate into local community
0 ## [3] [speciation_prob]: Probability of speciation per timestep in local community
Note: What’s the difference between a CLI argument and a MESS params file parameter, you may be asking yourself? Well, MESS CLI arguments specify how the simulations are performed (e.g. how many to run, how many cores to use, whether to print debugging information, etc), whereas MESS params file parameters dictate the structure of the simulations to run (e.g. sizes of communities, migration rates, specation rates, etc).
The defaults are all values of moderate size that will generate ‘normal’
looking simulations, and we won’t mess with them for now, but lets
just change a couple parameters to get the hang of it. Why don’t we
change the name
parameter of the local community, ‘island1’ is so
generic!. Pick your favorite island and change the name to this. Let’s
also set J
(size of the local community in individuals) equal to 500
as this will speed up the simulations (smaller local communities reach
equilibrium faster).
We will use the nano
text editor to modify params-simdata.txt
and change this parameter:
$ nano params-simdata.txt
Nano is a command line editor, so you’ll need to use only the arrow keys on the keyboard for navigating around the file. Nano accepts a few special keyboard commands for doing things other than modifying text, and it lists these on the bottom of the frame. After you are done making the changes your file will now have lines that look like this:
La_Reunion ## [0] [name]: Local community name
500 ## [1] [J]: Number of individuals in the local community
Note: For scientific computing, in almost all cases, spaces in variable names and labels should be considered harmful. Notice here how I replace the space in ‘La Reunion’ with an underscore (_
) character, this is common practice that you should adopt.
After you change this parameters you may save and exit nano by typing CTRL+o (to write Output), and then CTRL+x (to eXit the program).
Note: TheCTRL+x
notation indicates that you should hold down the control key (which is often styled ‘ctrl’ on the keyboard) and then push ‘x’.
Once we start running the simulations and performing MESS analyses all
the temp files and directories it needs are created in the
project_dir
directory and use the prefix specified by the
simulation_name
parameter. Because we use the default
(./default_MESS
) for the project_dir
for this tutorial, all
these intermediate directories will be of the form:
~/MESS/default_MESS/simdata_*
, or the analagous name that you used
for your assembly name.
Note on files in the project directory: MESS relies on the integrity of theproject_directory
for keeping track of various temporary files used by the simulation/analysis process. One result of this is that you can have multiple simulations of the same community assembly scenario using different parameter settings and you don’t have to manage all the files yourself! Another result is that you should not rename or move any of the files or directories inside your project directory, unless you know what you’re doing or you don’t mind if your simulations/analyses break.
Run simulations using your edited params file¶
Here we will start small and generate 10 simulations using 4 cores, just to get
practice running the sims. After this command finishes we’ll have a batch of 10
new summary statistics in our default_MESS
directory:
Special Note: In command line mode please be aware to always specify the number of cores with the-c
flag. If you do not specify the number of cores, MESS assumes you want only one of them, which will result in painfully slow simulation runs (serial processing).
## -p the params file we wish to use
## -s the number of simulations to perform
## -c the number of cores to allocate <-- Important!
$ MESS -p params-simdata.txt -s 10 -c 4
-------------------------------------------------------------
MESS [v.0.1.1]
Massive Eco-Evolutionary Synthesis Simulations
-------------------------------------------------------------
Project directory exists. Additional simulations will be appended.
<MESS.Region simdata: ['La_Reunion']>
establishing parallel connection:
host compute node: [4 cores] on goatzilla
Generating 10 simulation(s).
[####################] 100% Performing Simulations | 0:00:46 |
[####################] 100%
Finished 10 simulations
Note: You can see here that MESS is intelligently handling all the parallelization work for you. You tell it how many cores to use with the-c
flag and it portions out simulations among all the cores as they become available.
Inspect the output of the simulation runs¶
Simulation parameters and summary statistics are written to the
SIMOUT.txt
file in the project_dir
, which is by defualt created in the
current woring directory as ./default_MESS
. You can check the length of
this file:
$ wc -l default_MESS/SIMOUT.txt
11 default_MESS/SIMOUT.txt
# Use `less` to look inside the file. Use `q` to quit less when you are done.
less default_MESS/SIMOUT.txt
NB: Lines in this file are very long, soless
will wrap the text by default. Turn of line wrapping by typing-S
then pushingenter
, which directsless
to turn off line-wrapping. Makes it easier to read.
S_m J_m speciation_rate death_proportion trait_rate_meta ecological_strength generations community_assembly_model
100 750000 2.0 0.7 2.0 1.0 0.0 neutral point_mutation 0.0 2000 570.0 500.0 0.01 0.0 189.0 0.696
100 750000 2.0 0.7 2.0 1.0 0.0 neutral point_mutation 0.0 2000 570.0 500.0 0.01 0.0 43.0 0.238
Setting prior ranges on parameters¶
Rather than explicitly specifying MESS parameters, let’s say you’re
interested in actually estimating parameters from the observed data. We can do
this by simulating over a range of values for each parameter of interest, and
then using the MESS inference procedure to estimate these paramters. Let’s say
you would like to estimate the size of the local community (J
) and the
migration rate into the local community (m
). Edit your params file again
with nano
:
nano params-simdata.txt
and change the following two parameter settings. This time we specify a range of values:
1000-2000 ## [1] [J]: Number of individuals in the local community
0.001-0.01 ## [2] [m]: Migration rate into local community
Note: Saving and quitting fromnano
:CTRL+o
thenCTRL+x
Now run some more simulations (by default MESS will append these new simulations to the extant SIMOUT.txt file):
$ MESS -p params-simdata.txt -s 10 -c 4
-------------------------------------------------------------
MESS [v.0.1.1]
Massive Eco-Evolutionary Synthesis Simulations
-------------------------------------------------------------
Project directory exists. Additional simulations will be appended.
<MESS.Region simdata: ['La_Reunion']>
establishing parallel connection:
host compute node: [4 cores] on goatzilla
Generating 10 simulation(s).
[####################] 100% Performing Simulations | 0:00:46 |
[####################] 100%
Finished 10 simulations
Clean up ipcluster <ipyparallel.client.client.Client object at 0x7f15cc3c9090>
Let’s use cut
to look at just the columns we’re interested in (J
and m
), which are the 13th and 14th columns.
$ cut -f 13,14 default_MESS/SIMOUT.txt
J m
500.0 0.01
500.0 0.01
500.0 0.01
500.0 0.01
500.0 0.01
500.0 0.01
500.0 0.01
500.0 0.01
500.0 0.01
500.0 0.01
1118.0 0.00205
1168.0 0.00172
1515.0 0.00323
1061.0 0.0014
1305.0 0.00859
1434.0 0.00881
1397.0 0.00706
1096.0 0.00509
1889.0 0.00112
1699.0 0.00285
And you’ll see that these parameter values are now taking a range, as we specified. Now you are ready to move on to MESS Machine Learning Inference where you will see how we can analyse massive amounts of simulations under varying parameter ranges with a machine learning framework to estimate parameters of the model from real data.
Massive Eco-Evolutionary Synthesis Simulations (MESS) - API Tutorial¶
This is the second part of the MESS tutorial in which we introduce the API mode using jupyter notebooks. This is meant as a broad introduction but does make the assumption you’ve already completed the CLI tutorial some of the parameters and terminology. We will use as an example in this tutorial the spider community data set from La Reunion published by Emerson et al (2017). However, you can follow along with one of the other example datasets if you like, the procedure will be identical although your results will vary.
- Running and connecting to a Jupyter notebook
- Create and parameterize a new MESS Region
- Run MESS serial simulations in API mode
- Run MESS parallel simulations in API mode
- Adding more simulations to a region
- Using simulations to perform statistical inference
- References
Running and connecting to a Jupyter notebook¶
For the purposes of this tutorial, all command interactions will take place inside a jupyter notebook running on your personal computer. For the most part we will be writing and executing python commands. Jupyter is already installed as a dependency of MESS, but if you need help getting a server running see the excellent Jupyter notebook documentation.
Create and parameterize a new MESS Region¶
MESS API mode lets you dive under the hood of the CLI mode a bit. You
have all the power of the CLI mode, yet more flexibility. The first step
in API mode is to create a MESS Region
. A Region
encompasses a
very large Metacommunity and a much smaller local community which is
connected to it by colonization. In creating a Region
the only thing
you’re required to pass in is a name, so lets go with “LaReunion” (No
spaces!), as this is the region the empirical data is drawn from.
reunion = MESS.Region("LaReunion")
print(reunion.get_params())
------- MESS params file (v.0.1.0)----------------------------------------------
LaReunion ## [0] [simulation_name]: The name of this simulation scenario
./default_MESS ## [1] [project_dir]: Where to save files
0 ## [2] [generations]: Duration of simulations. Values/ranges Int for generations, or float [0-1] for lambda.
neutral ## [3] [community_assembly_model]: Model of Community Assembly: neutral, filtering, competition
point_mutation ## [4] [speciation_model]: Type of speciation process: none, point_mutation, protracted, random_fission
2.2e-08 ## [5] [mutation_rate]: Mutation rate scaled per base per generation
2000 ## [6] [alpha]: Abundance/Ne scaling factor
570 ## [7] [sequence_length]: Length in bases of the sequence to simulate
------- Metacommunity params: --------------------------------------------------
100 ## [0] [S_m]: Number of species in the regional pool
750000 ## [1] [J_m]: Total # of individuals in the regional pool
2 ## [2] [speciation_rate]: Speciation rate of metacommunity
0.7 ## [3] [death_proportion]: Proportion of speciation rate to be extinction rate
2 ## [4] [trait_rate_meta]: Trait evolution rate parameter for metacommunity
1 ## [5] [ecological_strength]: Strength of community assembly process on phenotypic change
------- LocalCommunity params: Loc1---------------------------------------------
Loc1 ## [0] [name]: Local community name
1000 ## [1] [J]: Number of individuals in the local community
0.01 ## [2] [m]: Migration rate into local community
0 ## [3] [speciation_prob]: Probability of speciation per timestep in local community
These are all the parameters of the model. The defaults are chosen to
reflect a typical oceanic island arthropod community. Don’t worry at
this point about all the parameters, lets focus for now on
community_assembly_model
, the size of the local community (J
),
and the rate of migration from the metacommunity to the local community
(m
). We will set parameter ranges for these, and each simulation
will sample a random value from this range. In a new cell use the
set_param()
method to change these values:
reunion.set_param("community_assembly_model", "*")
reunion.set_param("J", "1000-10000")
reunion.set_param("m", "0.001-0.01")
NB: Setting thecommunity_assembly_model
to*
indicates that we want to sample uniformly among all three of the model types: neutral, competition, and environmental filtering.
Print the params again to prove to yourself that the ranges are now set:
print(reunion.get_params())
------- MESS params file (v.0.1.0)----------------------------------------------
LaReunion ## [0] [simulation_name]: The name of this simulation scenario
./default_MESS ## [1] [project_dir]: Where to save files
0 ## [2] [generations]: Duration of simulations. Values/ranges Int for generations, or float [0-1] for lambda.
* ## [3] [community_assembly_model]: Model of Community Assembly: neutral, filtering, competition
point_mutation ## [4] [speciation_model]: Type of speciation process: none, point_mutation, protracted, random_fission
2.2e-08 ## [5] [mutation_rate]: Mutation rate scaled per base per generation
2000 ## [6] [alpha]: Abundance/Ne scaling factor
570 ## [7] [sequence_length]: Length in bases of the sequence to simulate
------- Metacommunity params: --------------------------------------------------
100 ## [0] [S_m]: Number of species in the regional pool
750000 ## [1] [J_m]: Total # of individuals in the regional pool
2 ## [2] [speciation_rate]: Speciation rate of metacommunity
0.7 ## [3] [death_proportion]: Proportion of speciation rate to be extinction rate
2 ## [4] [trait_rate_meta]: Trait evolution rate parameter for metacommunity
1 ## [5] [ecological_strength]: Strength of community assembly process on phenotypic change
------- LocalCommunity params: Loc1---------------------------------------------
Loc1 ## [0] [name]: Local community name
1000-10000 ## [1] [J]: Number of individuals in the local community
0.001-0.01 ## [2] [m]: Migration rate into local community
0 ## [3] [speciation_prob]: Probability of speciation per timestep in local community
Run MESS serial simulations in API mode¶
Now we can run community assembly simulations given our new parameterization
using the run()
method. There is one required argument to this method
(nsims
) which indicates the number of independent community assembly
realizations to perform.
reunion.run(sims=1)
Generating 1 simulation(s).
[####################] 100% Finished 0 simulations | 0:01:02 |
Run MESS parallel simulations in API mode¶
Like the CLI, the MESS API can make use of all the cores you can throw at it thanks to integration with the very nice IPyparallel library. To take a moment to launch an ipcluster instance.
Now we assume you have an ipyclient
object initialized in your notebook.
The Region.run()
method can also an optional argument (ipyclient
) for
specifying a connection to an ipyparallel backend, allowing for massive
parallelization. Let’s check to make sure how many cores our ipyclient is
attached to:
len(ipyclient)
40
Now call run and generate 40 simulations on the 40 cores:
reunion.run(sims=40, ipyclient=ipyclient)
Generating 40 simulation(s).
[####################] 100% Performing Simulations | 0:01:31 |
[####################] 100%
Finished 40 simulations
Now we generated 40 simulations in the parallel in the (roughly) the same time
it took to generate 1 simulation in serial. I say ‘roughly’ here for two
reasons. First, The simulations are stochastic, and the amount of time any
given simulation will take is Poisson distributed, so sometimes you’l get
‘unlucky’ with one that takes much longer. Second, by default the
generations
parameter is 0
, which indicates to uniformly sample a
lambda
value from the half-open interval [0-1). Small lambda
will
(on average) run faster than large lambda
, so again, another source of
variability in runtime.
Adding more simulations to a region¶
As with the CLI mode, if you find you need to add more simulations to a
Region
, for whatever reason, you can simply call run()
again, and this
will append the new simulations to what has already been run.
Using simulations to perform statistical inference¶
You can now proceed to the MESS Machine Learning Tutorial to learn how to use the simulations to perform model selection and parameter estimation on real data.
References¶
Emerson, B. C., Casquet, J., López, H., Cardoso, P., Borges, P. A.,
Mollaret, N., … & Thébaud, C. (2017). A combined field survey and
molecular identification protocol for comparing forest arthropod
biodiversity across spatial scales. Molecular ecology resources, 17(4),
694-707.
MESS Machine Learning (ML) Inference¶
Once you have performed a sufficient number of simulations you can now use these simulations to perform community assembly model selection and parameter estimation for observed community data. In this short tutorial we will demonstrate the fundamentals of how this works.
The MESS.inference
architecture is based on the powerful and
extensive scikit-learn python machine
learning library. Another really amazing resource I highly recommend is
the Python Data Science Handbook.
- Download and examine example empirical data
- Fetch pre-baked simulations
- Supported ensemble methods
- ML assembly model classification
- ML parameter estimation
- Perform posterior predictive simulations
- Experiment with other example datasets
- References
Download and examine example empirical data¶
We will be using as an example dataset of community-scale COI sequences (~500bp) and densely sampled abundances for the spider community on the island of La Reunion, an overseas department of France, and the largest of the Mascarene islands, located in the Indian Ocean approximately 1000 km from Madagascar. The data we will be using was collected and published by Emerson et al (2017). For this exercise, we will use the pre-processed summary statistics which were generated from the real data and which are available in the MESS github repository. For further instruction on properly formatting and converting raw data into MESS-ready format see the MESS raw data handling page.
In a new cell in your notebook you can download the Reunion spider data like this:
!wget https://raw.githubusercontent.com/messDiv/MESS/master/empirical_data/Reunion_spiders/spider.dat
NB: The!
prior to the command here indicates that the jupyter notebook should interpret this as a bash command executed at the command line. This is a handy way of executing bash commands from within the notebook environment, rather than returning to a terminal window on the cluster. It’s just a nice shortcut.
Now make a new cell and import MESS and pandas (which is a python structured data library providing the DataFrame class), and read in the data you just downloaded.
%matplotlib inline
import MESS
import pandas as pd
spider_df = pd.read_csv("spider.dat", index_col=0)
spider_df[:5]
Special Note: The
%matplotlib inline
command tells jupyter to draw the figures actually inside your notebook environment. If you don’t include this your figures won’t show up!NB: Importing pandas as
pd
is pretty cannonical. It makes typing out pandas commands shorter because you can reference it aspd
rather thanpandas
.NB: The final line in the above command asks python to display the first 5 rows of the
spider_df
dataframe. It should look like this:

Fetch pre-baked simulations¶
Since it can take quite some time to run a number of simulations
sufficient for model selection and parameter estimation we will use a
suite of pre-baked simulations generated ahead of time. This dataset is
composed of ~8k simulations under each of the community assembly models with
a few of the parameters sampled from distributions and the rest set to fixed
values. Fetch this dataset from the compphylo site with wget
:
!wget https://compphylo.github.io/Oslo2019/MESS_files/MESS_simulations/SIMOUT.txt
!wc -l SIMOUT.txt
100%[======================================>] 14,234,338 50.0M/s in 0.3s
2019-08-11 01:25:27 (50.0 MB/s) - "SIMOUT.txt" saved [14234338/14234338]
24440 SIMOUT.txt
NB: Thewc
command counts the number of lines if you pass it the-l
flag. You can see this series of ~25k simulations is about 14MB.
Supported ensemble methods¶
MESS currently supports three different scikit-learn
ensemble methods for
classification and parameter estimation:
- RandomForest (rf):
- GradientBoosting (gb):
- AdaBoost (ab):
ML assembly model classification¶
The first step is now to assess the model of community assembly that
best fits the data. The three models are neutral
, in which all
individuals are ecologically equivalent; competition
, in which
species have traits, and differential survival probability is weighted
by distance of traits from the trait mean in the local community (closer
to the trait mean == higher probability of death); and filtering
, in
which there is an environmental optimum, and proximity of trait values
to this optimum is positively correlated with survival probability.
Basically we want to know, are individuals in the local community ecologically equivalent, and if not are, they interacting more with each other or more with the local environment.
Here we create a MESS.inference.Classifier
object and pass in the
empirical dataframe, the simulations, and specify RandomForest (rf
) as
the algorithm to use (other options are GradientBoosting (gb
) and
AdaBoost (ab
)). Full documentation of all the machine learning inference
architecture is available in the MESS API specification.
cla = MESS.inference.Classifier(empirical_df=spider_df, simfile="SIMOUT.txt", algorithm="rf")
est, proba = cla.predict(select_features=False, param_search=False, quick=True, verbose=True)
The Classifier.predict()
method provides a sophisticated suite
of routines for automating machine learning performance tuning. Here we
disable all of this in order for it to run quickly, trading
accuracy for performance. The select_features
argument controls
whether or not to prune summary statistics based on correlation and
informativeness. The param_search
argument controls whether or not
to exhaustively search the space of parameters to maximize accuracy of
the ML algorithm selected. Finally, the quick
argument subsets the
size of the simulation database to 1000 simulations, again as a means of
running quickly to verify, for example, whether all the input data and
simulations are formatted properly. In normal conditions, when you
want to investigate real data you should use select_featers=True,
param_search=True, quick=False
. This will result in much more accurate
inference, yet will take much more time to run.
The call to cla.predict()
returns both the most likely community
assembly model (est
) and the model probabilities (proba
), which
you can view:
from IPython.display import display
display(est, proba)
NB: The first line of this codeblock imports thedisplay
method from theIPython
library, which makes pd.DataFrames display nicely.

Here we see that the neutral
model is vastly favored over the
competition
and fitering
models.
Now we can examine the weight of importance of each feature, an
indication of the contribution of each summary statistic to the
inference of the ML algorithm. You can also get a quick and dirty plot
of feature importances directly from the Classifier
object with the
plot_feature_importances()
method.
display(cla.feature_importances())
cla.plot_feature_importance(figsize=(5,5))
NB: Thefigsize
argument controls the size of the output figure. The default is(10, 10)
, so here we just make it a bit smaller to fit better in the notebook window.

The plot_feature_importance()
method prunes out all the summary
statistics that contribute less than 5% of information to the final
model. This method is simply for visualization. Here you can see that
the shape of the abundance distributions (as quantified by the first 4
Hill numbers) and the standard deviation of nucleotide diversity (pi)
within the local community are the most important summary statistics for
differentiating between the different community assembly models.
ML parameter estimation¶
Now that we have identified the neutral model as the most probable, we can estimate parameters of the emipirical data given this model. Essentially, we are asking the question “What are the parameters of the model that generate summary statistics most similar to those of the empirical data?”
The MESS.inference
module provides the Regressor
class, which
has a very similar API to the Classifier
. We create a Regressor
and pass in the empirical data, the simulations, and the machine
learning algorithm to use, but this time we also add the
target_model
which prunes the simulations to include only those of
the community assembly model of interest. Again, we call predict()
on the regressor and pass in all the arguments to make it run fast, but
do a bad job.
rgr = MESS.inference.Regressor(empirical_df=spider_df, simfile="SIMOUT.txt", target_model="neutral", algorithm="rfq")
est = rgr.predict(select_features=False, param_search=False, quick=True, verbose=False)
NB: Note that here we ask for therfq
algorithm, which designates random forest quantile regression, and allows for constructing prediction intervals, something the stockrf
algorithm doesn’t do. The gradient boosting method (gb
) provides prediction intervals natively.
The return value est
is a dataframe that contains predicted values
for each model parameter and 95% prediction intervals, which are
conceptually in the same ball park as maximum likelihood confidence intervals
(CI) and and Bayesian highest posterior densities (HPD).
display(est)

Similar to the classifier, you can also extract feature importances from the regressor, to get an idea about how each feature is contributing to parameter estimation. In this case, the feature importance plots are a little more useful than the giant table. Also note that we’re requesting a slighly larger figure, because this time there will be much more information.
display(rgr.feature_importances())
rgr.plot_feature_importance(figsize=(20,8))


You can see that some parameters are strongly driven by one or a couple
of features, and for some the feature importances are more diffuse. For
example species richness (S
) contributes overwhelmingly to
estimation of local community size (J
), which makes sense. On the
other hand, many more factors seem to contribute equally to estimation
of migration rate (m
).
Perform posterior predictive simulations¶
Finally, a very important step in machine learning inference is to validate the parameter estimates by performing posterior predictive checks (PPCs). The logic of this procedure is that we will generate a suite of simulations using our most probable parameter estimates, and compare these simulations to our empirical data. For our purposes here we will project the summary statistics of observed and simulated datasets into principle component space. If the parameter estimates are a good fit to the data then our posterior predictive simulations will cluster together with the real data, whereas if the parameter estimates are a poor fit, then the real data will be quite different from the simulations.
Spoiler alert: Posterior predictive simulations can take quite a while to run. With the parameters we get even with this toy data a reasonable number of sims could take a couple hours. So! For the purpose of this exercise we are going to transform the estimated parameters to make the simulations run faster. DO NOT DO THIS WITH YOUR REAL DATA!
est["J"] /= 2
est["m"] *= 10
est["_lambda"] /= 2
est
Here we reduced the size of the local community (J
), increased the
rate of migration (m
), and decreased the duration of the simulation
(_lambda
). All these things will make the simulations run more
quickly. Now go ahead and run this next cell, it should take one to two
minutes.
MESS.inference.posterior_predictive_check(empirical_df=spider_df,
parameter_estimates=est,
est_only=True,
nsims=20,
verbose=True,
force=True)
Generating 20 simulation(s).
[####################] 100% Finished 19 simulations | 0:02:57 |
Calculating PCs and plotting
NB: Theposterior_predictive_check()
function can also accept anipyclient
parameter to specify to use an ipyparallel backend. For simplicity here we will not use this option, but in reality it would be a good idea to parallelize the posterior predictive simulations. See the MESS parallelization docs for more info about how to implement the parallel backend.
This time the only thing we have to pass in is the empirical data and
the dataframe of prediction values, but we show a couple more arguments
here for the sake of completeness. Setting est_only
to True
will use
only the exact parameter estimates for all simulations, whereas setting it to
False
will sample uniformly between the upper and lower prediction interval
for each simulation. nsims
should be obvious, the number of posterior
predictive simulations to perform. force
will overwrite any pre-existing
simulations, thus preventing mixing apples with oranges from different
simulation runs. On the other hand if this is False
(the default), then
this is a handy way to add more PPCs to your analysis.
The posterior_predictive_check()
method will also display the summary
statistics for the simulated and observed data projected into PC space. The
observed data is colored in red. Here is a pretty typical example of a good
fit of parameters to the data:

And here is a nice example of a poor fit of parameters to the data:

Quite a striking, and obvious difference.
Experiment with other example datasets¶
Now you may wish to experiment with some of the other empirical datasets which are provided on the MESS github repo, or with exploring the substantial MESS inference procedure API documentation where we show a ton of other cool stuff MESS is capable of that we just don’t have time to go over in this short tutorial.
References¶
Emerson, B. C., Casquet, J., López, H., Cardoso, P., Borges, P. A.,
Mollaret, N., … & Thébaud, C. (2017). A combined field survey and
molecular identification protocol for comparing forest arthropod
biodiversity across spatial scales. Molecular ecology resources, 17(4),
694-707.
MESS Parallelization Docs¶
Either run this by hand on a compute node or add this line to your jupyter notebook job submission script before the call to launch the notebook server:
ipcluster start -n 40 --cluster-id=MESS --daemonize
Here -n 40
specifies the number of cores to use, so you’ll need to
change this to allocate the number you request in the job script.
Now inside the notebook you want to run simulations in, attach to the running ipyparallel instance like this:
try:
ipyclient = ipp.Client(cluster_id="MESS")
print(len(ipyclient))
except:
pass
And now you can call use the parallel backend when running simulations inside a jupyter notebook:
import MESS
r = MESS.Region("Watdo")
r.run(nsims=1000, ipyclient=ipyclient)
In this miniature tutorial we will guide you through the steps of converting raw data into properly formatted MESS-friendly dataframes.
Put stuff here¶
MESS Model Parameters¶
The parameters contained in a params file affect the behavior of various parts of the forward-time and backward-time assembly process. The defaults that we chose are fairly reasonable values as as starting point, however, you will always need to modify at least a few of them (for example, to indicate the location of your data), and often times you will want to modify many of the parameters.
Below is an explanation of each parameter setting, the eco-evolutionary process that it affects, and example entries for the parameter into a params.txt file.
NB: Most parameters for which is it sensible to do so will accept a prior range which will be sampled from uniformly for each simulation (e.g. a value of1000-5000
for parameterJ
will choose a newJ
for each simulation inclusively bounded by these values).
simulation_name¶
The simulation name is used as the prefix for all output files. It should be a unique identifier for this particular set of simulations, meaning the set of parameters you are using for the current data set. When I run multiple related simulations I usually use names indicating the specific parameter combinations (e.g., filtering_nospeciation, J5000_neutral).
Example: New community simulations are created with the -n options to MESS:
## create a new assembly named J1000_neutral
$ MESS -n J1000_neutral
project_dir¶
A project directory can be used to group together multiple related simulations. A new directory will be created at the given path if it does not already exist. A good name for project_dir will generally be the name of the community/system being studied. The project dir path should generally not be changed after simulations/analysis are initiated, unless the entire directory is moved to a different location/machine.
Example entries into params.txt:
/home/watdo/MESS/galapagos ## [1] create/use project dir called galapagos
galapagos ## [1] create/use project dir called galapagos in cwd (./)
generations¶
This parameter specifies the amount of time to run forward-time simulations. It can be specified in a number of different ways, but overall time can be considered either in terms of Wright-Fisher (WF) generations or in terms of Lambda. For WF generations you should specify an integer value (or a range of integer values) which will run the forward-time process for WF * J / 2 time-steps (where a time-step is one birth/death/colonization/speciation event). For Lambda you may select either an exact Lambda value (a real value between 0 and 1 exclusive), or you can set generations equal to 0, which will draw random Lambda values between 0 and 1 for each simulation.
Example entries into params.txt:
0 ## [2] [generations]: Sample random Lambda values for each simulation
100 ## [2] [generations]: Run each simulation for 100 WF generations
50-100 ## [2] [generations]: Sample uniform between 50-100 WF generations for each simulation
community_assembly_model¶
With this parameter you may specify a neutral or non-neutral scenario for the forward time process. There are currently three different options for this parameter: neutral, filtering, or competition. The neutral case indicates full ecological equivalence of all species, so all individuals have an equal probability of death at each time-step. In the filtering and competition models survival probability is contingent on proximity of species trait values to the environmental optimum, or distance from the local trait mean, respectively. You may also use the wildcard * here and MESS will randomly sample one community assembly model for each simulation.
Example entries into params.txt:
neutral ## [3] [community_assembly_model]: Select the neutral process forward-time
filtering ## [3] [community_assembly_model]: Select the environmental filtering process
* ## [3] [community_assembly_model]: Randomly choose one of the community assembly models
speciation_model¶
Specify a speciation process in the local community. If none then no speciation happens locally. If point_mutation then one individual will instantaneously speciate at rate speciation_prob for each forward-time step. If random_fission then one lineage will randomly split into two lineages at rate speciation_prob with the new lineage receiving n = U~(1, local species abundance) individuals, and the parent lineage receiving 1 - n individuals. protracted will specify a model of protracted speciation, but this is as yet unimplemented.
Example entries into params.txt:
none ## [4] [speciation_model]: No speciation in the local community
point_mutation ## [4] [speciation_model]: Point mutation specation process
mutation_rate¶
Specify the mutation rate for backward-time coalescent simulation of genetic variation. This rate is the per base, per generation probability of a mutation under an infinite sites model.
Example entries into params.txt:
2.2e-08 ## [5] [mutation_rate]: Mutation rate scaled per base per generation
alpha¶
Scaling factor for transforming number of demes to number of individuals.
alpha
can be specified as either a single integer value or as a range
of values.
Example entries to params.txt file:
2000 ## [6] [alpha]: Abundance/Ne scaling factor
1000-10000 ## [6] [alpha]: Abundance/Ne scaling factor
sequence_length¶
Length of the sequence to simulate in the backward-time process under an infinite sites model. This value should be specified based on the length of the region sequenced for the observed community data in bp.
Example entries to params.txt file:
570 ## [7] [sequence_length]: Length in bases of the sequence to simulate
S_m¶
S_m specifies the total number of species to simulate in the metacommunity. Larger values will result in more singletons in the local community and reduced rates of multiple-colonization.
Example entries to params.txt file:
500 ## [0] [S_m]: Number of species in the regional pool
100-1000 ## [0] [S_m]: Number of species in the regional pool
J_m¶
The total number of individuals in the metacommunity. This value is divided up
among all S_m
species following a logseries distribution. J_m
values
which are smaller will produce more even abundances in the metacommunity, which
will result in more equal probability of colonization among species.
Example entries to params.txt:
50000 ## [1] [J_m]: Total # of individuals in the regional pool
50000-500000 ## [1] [J_m]: Total # of individuals in the regional pool
speciation_rate¶
The speciation rate parameter of the birth-death process which generates the metacommunity phylogeny.
Example entries to params.txt:
2 ## [2] [speciation_rate]: Speciation rate of metacommunity
death_proportion¶
The rate of extinction proportional to the speciation rate for the birth-death metacommunity phylogeny.
Example entries to params.txt:
0.7 ## [3] [death_proportion]: Proportion of speciation rate to be extinction rate
trait_rate_meta¶
The variance of the Brownian motion process for evolving traits along the metacommunity phylogeny. Smaller values will result in more trait similarity within clades, and larger values will result in more overdispersed traits.
Example entries to params.txt:
2 ## [4] [trait_rate_meta]: Trait evolution rate parameter for metacommunity
ecological_strength¶
This parameter dictates the strength of interactions in the environmental filtering and competition models. As the value of this parameter approaches zero, ecological strength is reduced and the assembly process increasingly resembles neutrality (ecological equivalence). Larger values increasingly bias probability of death against individuals with traits farther from the environmental optimum (in the filtering model).
In the following examples the environmental optimum is 3.850979, and the ecological strength is varied from 0.001 to 100. Column 0 is species ID, column 1 is trait value, column 2 is unscaled probability of death, and column 3 is proportional probability of death. Models with strength of 0.001 and 0.01 are essentially neutral. Strength of 0.1 confers a slight advantage to individuals very close to the local optimum (e.g. species ‘t97’).



Ecological strength of 1 (below, left panel) is noticeably non-neutral (e.g. ‘t97’ survival probability is 10x greater than average). A value of 10 for this parameter generates a _strong_ non-neutral process (below, center panel: ‘t97’ is 100x less likely to die than average, and the distribution of death probabilities is more varied). Ecological strength values >> 10 are _extreme_ and will probably result in degenerate behavior (e.g. strength of 100 (below, right panel) in which several of the species will be effectively immortal, with survival probability thousands of times better than average).



Example entries to params.txt:
1 ## [5] [ecological_strength]: Strength of community assembly process on phenotypic change
0.001-1 ## [5] [ecological_strength]: Strength of community assembly process on phenotypic change
name¶
This is a label for the local community and is used to generate output file names. Generally not that useful unless you’re simulating multiple local communities.
Example entries to params.txt:
island1 ## [0] [name]: Local community name
J¶
The number of individuals in the local community. This value is fixed for the duration of a community assembly simulation. The size of J will have a strong impact on the consequent species richness and abundance distribution.
Example entries to params.txt:
1000-2000 ## [1] [J]: Number of individuals in the local community
m¶
The probability at any given timestep that a randomly sampled individual in the local community will be replaced by a migrant from the metacommunity. Larger values of m will generally result in increased species richness.
Example entries to params.txt:
0.01 ## [2] [m]: Migration rate into local community
speciation_prob¶
Given that a randomly sampled individual from the local community is being replaced by the offspring of another individual from the local community, this parameter defines the probability that this offspring individual will undergo a speciation process.
Example entries to params.txt:
0 ## [3] [speciation_prob]: Probability of speciation per timestep in local community
0.0001-0.001 ## [3] [speciation_prob]: Probability of speciation per timestep in local community
API Documentation¶
This is the API documentation for MESS
, and provides detailed information
on the Python programming interface. See the Massive Eco-Evolutionary Synthesis Simulations (MESS) - API Tutorial for an
introduction to using this API to run simulations.