API Documentation¶

This is the API documentation for MESS, and provides detailed information on the Python programming interface. See the Intro API Tutorial for an introduction to using this API to run simulations.

Simulation model¶

Region¶

class MESS.Region(name, quiet=False, log_files=False)¶

The MESS Region is the fundamental unit of a batch of simulation scenarios. A Region encompasses both a Metacommunity and one or more Local Communities, and orchestrates the community assembly process.

Parameters:

name (str) – The name of this MESS simulation. This is used for creating output files.
quiet (bool) – Whether to print progress of simulations or remain silent.
log_files (bool) – For each community assembly simulation create a a directory in the outdir, write the exact parameters for the simulation, and dump the megalog to a file. The megalog includes all information about the final state of the local community, prior to calculating summary statistics per species. Primarily for debugging purposes.

run(sims, ipyclient=None, force=False, quiet=False)¶

Do the heavy lifting here.

Parameters:

sims (int) – The number of MESS community assembly simulations to perform.
ipyclient (ipyparallel.Client) – If specified use this ipyparallel client to parallelize simulation runs. If not specified simulations will be run serially.
force (bool) – Whether to append to or overwrite results from previous simulations. Setting force to True will overwrite any previously generated simulation in the project_dir/SIMOUT.txt file..

Para bool quiet:

Whether to display progress of these simulations.

set_colonization_matrix(matrix)¶: Set the matrix that describes colonization rate between local communities.

set_param(param, value, quiet=True)¶

A convenience function for setting parameters in the API mode, which turns out to be a little annoying if you don’t provide this. With the set_param method you can set parameters on the Region, the Metacommunity, or the LocalCommunity. Simply pass the parameter name and the value, and this method identifies the appropriate target parameter.

Parameters:	param (string) – The name of the parameter to set. value – The value of the parameter to set. quiet (bool) – Whether to print info to the console.

write_params(outfile=None, outdir=None, full=False, force=False)¶

Write out the parameters of this model to a file properly formatted as input for the MESS CLI. A good and simple way to share/archive parameter settings for simulations. This is also the function that’s used by __main__ to generate default params.txt files for MESS -n.

Parameters:

outfile (string) – The name of the params file to generate. If not specified this will default to params-<Region.name>.txt.
outdir (string) – The directory to write the params file to. If not specified this will default to the project_dir.
full (bool) – Whether to write out only the parameters of the specific parameter values of this Region, or to write out the parameters including prior ranges for parameter values..
force (bool) – Whether to overwrite if a file already exists.

Metacommunity¶

class MESS.Metacommunity(meta_type='logser', quiet=False)¶

The metacommunity from which individuals are sampled for colonization to the local community.

Parameters:	meta_type (str) – Specify the distribution of abundance among species in the metacommunity. Can be one of: logser, lognorm, uniform, or the file name from which to read metacommunity abundances. quiet (bool) – Whether to print info about metacommunity construction.

get_migrants(nmigrants=1)¶

Sample individuals from the Metacommunity. Each individual is independently sampled with replacement from the Metacommunity.

Returns:	A tuple of lists of species IDs (str) and trait values (float).

Local Community¶

class MESS.LocalCommunity(name='Loc1', J=1000, m=0.01, quiet=False)¶

Construct a local community.

Parameters:	name (str) – The name of the LocalCommunity. J (int) – The number of individuals in the LocalCommunity. m (float) – Migration rate into the LocalCommunity. This is the probability per time step that a death is replaced by a migrant from the metacommunity. quiet (bool) – Print out some info about the local community.

get_abundances(octaves=False, raw_abunds=False)¶

Get the SAD of the local community.

Parameters:	octaves (bool) – Return the SAD binned into size-class octaves. raw_abunds (bool) – Return the actual list of abundances per species, without binning into SAD.
Returns:	If raw_abunds then returns a list of abundances per species, otherwise returns an OrderedDict with keys as abundance classes and values as counts of species per class.

get_community_data()¶

Gather the community data and format it in such a way as to prepare it for calling MESS.stats.calculate_sumstats(). This is a way of getting simulated data that is in the exact format empirical data is required to be in. Useful for debugging and experimentation.

Returns:	A pandas.DataFrame with 4 columns: “pi”, “dxy”, “abundance”, and “trait”, and one row per species.

get_stats()¶

Simulate genetic variation per species in the local community, then aggregate abundance, pi, dxy, and trait data for all species and calculate summary statistics.

Returns:	A pandas.DataFrame including all MESS model parameters and all summary statistics.

step(nsteps=1)¶

Run one or more generations of birth/death/colonization timesteps. A generation is J/2 timesteps (convert from Moran to WF generations).

Parameters:	nsteps (int) – The number of generations to simulate.

Inference Procedure¶

class MESS.inference.Ensemble(empirical_df, simfile='', sim_df='', target_model=None, algorithm='rf', metacommunity_traits=None, verbose=False)¶

The Ensemble class is a parent class from which Classifiers and Regressors inherit shared methods. You normally will not want to create an Ensemble class directly, but the methods documented here are inherited by both Classifier() and Regressor() so may be called on either of them.

Attention:	Ensemble objects should never be created directly. It is a base class that provides functionality to Classifier() and Regressor().

cross_val_predict(cv=5, features='', quick=False, verbose=False)¶

Perform K-fold cross-validation prediction. For each of the cv folds, simulations will be split into sets of K - (1/K) training simulations and 1/K test simulations.

Note

CV predictions are not appropriate for evaluating model generalizability, these should only be used for visualization and exploration.

Parameters:	cv (int) – The number of K-fold cross-validation splits to perform. quick (bool) – If True skip feature selection and hyper-parameter tuning, and subset simulations. Runs fast but does a bad job. For testing. verbose (bool) – Report on progress. Depending on the number of CV folds this will be more or less chatty (mostly useless except for debugging).
Returns:	The array of predicted targets for each set of features when it was a member of the held-out testing set. Also saves the results in the Estimator.cv_preds variable.

cross_val_score(cv=5, quick=False, verbose=False)¶

Perform K-fold cross-validation scoring. For each of cv folds simulations will be split into sets of K - (1/K) training simulations and 1/K test simulations.

Parameters:	cv (int) – The number of K-fold cross-validation splits to perform. quick (bool) – If True skip feature selection and hyper-parameter tuning, and subset simulations. Runs fast but does a bad job. For testing. verbose (bool) – Report on progress. Depending on the number of CV folds this will be more or less chatty (mostly useless except for debugging).
Returns:	The array of scores of the estimator for each K-fold. Also saves the results in the Estimator.cv_scores variable.

dump(outfile)¶

Save the model to a file on disk. Useful for saving trained models to prevent having to retrain them.

Parameters:	outfile (str) – The file to save the model to.

feature_importances()¶

Assuming predict() has already been called, this method will return the feature importances of all features used for prediction.

Returns:	A pandas.DataFrame of feature importances.

feature_selection(quick=False, verbose=False)¶

Access to the feature selection routine. Uses BorutaPy, an all-relevant feature selection method: https://github.com/scikit-learn-contrib/boruta_py http://danielhomola.com/2015/05/08/borutapy-an-all-relevant-feature-selection-method/

Hint:	Normally you will not run this on your own, but will use it indirectly through the predict() methods.
Parameters:	quick (bool) – Run fast but do a bad job. verbose (bool) – Print lots of quasi-informative messages.

static load(infile)¶

Load a MESS.inference model from disk. This is complementary to the MESS.inference.Ensemble.dump() method.

Parameters:	infile (str) – The file to load a trained model from.
Returns:	Returns the MESS.inference.Ensemble object loaded from the input file.

plot_feature_importance(cutoff=0.05, figsize=(10, 12), layout=None, subplots=True, legend=False)¶

Construct a somewhat crude plot of feature importances, useful for a quick and dirty view of these values. If more than one feature present in the model then a grid-layout is constructed and each individual feature is displayed within a subplot. This function is a thin wrapper around pandas.DataFrame.plot.barh().

Parameters:

cutoff (float) – Remove any features that do not have greater importance than this value across all plotted features. Just remove uninteresting features to reduce the amount of visual noise in the figures.
figsize (tuple) – A tuple specifying figure width, height in inches.
layout (tuple) – A tuple specifying the row, column layout of the sub-panels. By default we do our best, and it’s normally okay.
subplots (bool) – Whether to plot each feature individually, or just cram them all into one huge plot. Unless you have only a few features, setting this option to False will look insane.
legend (bool) – Whether to plot the legend.

Returns:

Returns all the matplotlib axes

set_data(empirical_df, metacommunity_traits=None, verbose=False)¶

A convenience function to allow using pre-trained models to make predictions on new datasets without retraining the model. This will calculate summary statistics on input data (recycling metacommunity traits if these were previously input), and reshape the statistics to match the features selected during initial model construction.

This is only sensible if the data from the input community consists of identical axes as the data used to build the model. This will be useful if you have community data from mutiple islands in the same archipelago, different communities that share a common features, and share a metacommunity.

Parameters:	empirical_df (pandas.DataFrame) – A DataFrame containing the empirical data. This df has a very specific format which is documented here. metacommunity_traits (array-like) – A list or np.array of the trait values from the metacommunity. Used for calculating some of the trait based summary statistics. verbose (bool) – Print progress information.

set_features(feature_list='')¶

Specify the feature list to use for classification/regression. By default the methods use all features, but if you want to specify exact feature sets to use you may call this method.

Parameters:	feature_list (list) – The list of features (summary statistics) to retain for downstream analysis. Items in this list should correspond exactly to summary statistics in the simulations or else it will complain.

set_targets(target_list='')¶

Specify the target (parameter) list to use for classification/regression. By default the classifier will only consider community_assembly_model and the regressor will use all targets, but if you want to specify exact target sets to use you may call this method.

Parameters:	target_list (list) – The list of targets (model parameters) to retain for downstream analysis. Items in this list should correspond exactly to parameters in the simulations or else it will complain.

Model Selection (Classification)¶

class MESS.inference.Classifier(empirical_df, simfile='', sim_df='', algorithm='rf', metacommunity_traits=None, verbose=False)¶

This class wraps all the model selection machinery.

Parameters:

empirical_df (pandas.DataFrame) – A DataFrame containing the empirical data. This df has a very specific format which is documented here.
simfile (string) – The path to the file containing all the simulations.
algorithm (string) – One of the Supported Ensemble Methods to use for parameter estimation.
metacommunity_traits (array-like) – A list or np.array of the trait values from the metacommunity. Used for calculating some of the trait based summary statistics.
verbose (bool) – Print detailed progress information.

cross_val_predict(cv=5, quick=False, verbose=False)¶

A thin wrapper around Ensemble.cross_val_predict() that basically just calculates some Classifier specific statistics after the cross validation prodecure. This function will calculate and populate class variables:

Classifier.classification_report: Mean absolute error

Parameters:	cv (int) – The number of cross-validation folds to perform. quick (bool) – Whether to downsample to run fast but do a bad job. verbose (bool) – Whether to print progress messages.
Returns:	A numpy.array of model class predictions for each simulation when it was a member of the held-out test set.

plot_confusion_matrix(ax='', figsize=(8, 8), cmap=<matplotlib.colors.LinearSegmentedColormap object>, cbar=False, title='', normalize=False, outfile='')¶

Plot the confusion matrix for CV predictions. Assumes Classifier.cross_val_predict() has been called. If not it complains and tells you to do that first.

Parameters:

ax (matplotlib.pyploat.axis) – The matplotlib axis to draw the plot on.
figsize (tuple) – If not passing in an axis, specify the size of the figure to plot.
cmap (matplotlib.pyplot.cm) – Specify the colormap to use.
cbar (bool) – Whether to add a colorbar to the figure.
title (str) – Add a title to the figure.
normalize (bool) – Whether to normalize the bin values (scale to 1/# simulations).
outfile (str) – Where to save the figure. This parameter should include the desired output file format, e.g. .png, .svg or .svg.

Returns:

The matplotlib.axis on which the confusion matrix was plotted.

predict(select_features=True, param_search=True, by_target=False, quick=False, force=False, verbose=False)¶

Predict the community assembly model class probabilities.

Parameters:

select_features (bool) – Whether to perform relevant feature selection. This will remove features with little information useful for model prediction. Should improve classification performance, but does take time.
param_search (bool) – Whether to perform ML classifier hyperparameter tuning. If False then classification will be performed with default classifier options, which will almost certainly result in poor performance, but it will run really fast!.
by_target (bool) – Whether to predict multiple target variables simultaneously, or each individually and sequentially.
quick (bool) – Reduce the number of retained simulations and the number of feature selection and hyperparameter tuning iterations to make the prediction step run really fast! Useful for testing.
force (bool) – Force re-running feature selection and hyper-parameter tuning. This is basically here to prevent you from shooting yourself in the foot inside a for loop with select_features=True when really what you want (most of the time) is to just run this once, and call predict() multiple times without redoing this.
verbose (bool) – Print detailed progress information.

Returns:

A tuple including the predicted model and the probabilities per model class.

Parameter Estimation (Regression)¶

class MESS.inference.Regressor(empirical_df, simfile='', sim_df='', target_model=None, algorithm='rfq', metacommunity_traits=None, verbose=False)¶

This class wraps all the parameter estimation machinery.

Parameters:

empirical_df (pandas.DataFrame) – A DataFrame containing the empirical data. This df has a very specific format which is documented here.
simfile (string) – The path to the file containing all the simulations.
target_model (string) – The community assembly model to specifically use. If you include this then the simulations will be read and then filtered for only this community_assembly_model.
algorithm (string) – The ensemble method to use for parameter estimation.
metacommunity_traits (array-like) – A list or np.array of the trait values from the metacommunity. Used for calculating some of the trait based summary statistics.
verbose (bool) – Print lots of status messages. Good for debugging, or if you’re really curious about the process.

cross_val_predict(cv=5, quick=False, verbose=False)¶

A thin wrapper around Ensemble.cross_val_predict() that basically just calculates some Regressor specific statistics after the cross validation prodecure. This function will calculate and populate class variables:

Regressor.MAE: Mean absolute error
Regressor.RMSE: Root mean squared error
Regressor.vscore: Explained variance score
Regressor.r2: Coefficient of determination regression score

As well as Regressor.cv_stats which is just a pandas.DataFrame of the above stats.

Parameters:	cv (int) – The number of cross-validation folds to perform. quick (bool) – Whether to downsample to run fast but do a bad job. verbose (bool) – Whether to print progress messages.
Returns:	A numpy.array of parameter estimates for each simulation when it was a member of the held-out test set.

plot_cv_predictions(ax='', figsize=(10, 5), figdims=(2, 3), n_cvs=1000, title='', targets='', outfile='')¶

Plot the cross validation predictions for this Regressor. Assumes Regressor.cross_val_predict() has been called. If not it complains and tells you to do that first.

Parameters:

ax (matplotlib.pyploat.axis) – The matplotlib axis to draw the plot on.
figsize (tuple) – If not passing in an axis, specify the size of the figure to plot.
figdims (tuple) – The number of rows and columns (specified in that order) of the output figure. There will be one plot per target parameter, so there should be at least as many available cells in the specified grid.
n_cvs (int) – The number of true/estimated points to plot on the figure.
title (str) – Add a title to the figure.
targets (list) – Specify which of the targets to include in the plot.
outfile (str) – Where to save the figure. This parameter should include the desired output file format, e.g. .png, .svg or .svg.

Returns:

The flattened list of matplotlib axes on which the scatter plots were drawn, one per target.

predict(select_features=True, param_search=True, by_target=False, quick=False, force=True, verbose=False)¶

Predict parameter estimates for selected targets.

Parameters:

select_features (bool) – Whether to perform relevant feature selection. This will remove features with little information useful for parameter estimation. Should improve parameter estimation performance, but does take time.
param_search (bool) – Whether to perform ML regressor hyperparamter tuning. If False then prediction will be performed with default options, which will almost certainly result in poor performance, but it will run really fast!.
by_target (bool) – Whether to estimate all parameters simultaneously, or each individually and sequentially. Some ensemble methods are only capable of performing individual parameter estimation, in which case this parameter is forced to True.
quick (bool) – Reduce the number of retained simulations and the number of feature selection and hyperparameter tuning iterations to make the prediction step run really fast! Useful for testing.
force (bool) – Force re-running feature selection and hyper-parameter tuning. This is basically here to prevent you from shooting yourself in the foot inside a for loop with select_features=True when really what you want (most of the time) is to just run this once, and call predict() multiple times without redoing this.
verbose (bool) – Print detailed progress information.

Returns:

A pandas.DataFrame including the predicted value per target parameter, and 95% prediction intervals if the ensemble method specified for this Regressor supports it.

prediction_interval(interval=0.95, quick=False, verbose=False)¶

Add upper and lower prediction interval for algorithms that support quantile regression (rfq, gb).

Hint:	You normaly won’t have to call this by hand, as it is incorporated automatically into the predict() methods. We allow access to in for experimental purposes.
Parameters:	interval (float) – The prediction interval to generate. quick (bool) – Subsample the data to make it run fast, for testing. The quick parameter doesn’t do anything for rfq because it’s already really fast (the model doesn’t have to be refit). verbose (bool) – Print information about progress.
Returns:	A pandas.DataFrame containing the model predictions and the prediction intervals.

Classification Cross-Validation¶

MESS.inference.classification_cv(simfile, data_axes='', algorithm='rf', quick=True, verbose=False)¶

A convenience function to make it easier and more straightforward to run classification CV. This basically wraps the work of generating the synthetic community (dummy data), selecting which input data axes to retain (determines which summary statistics are used by the ML), creates the Classifier and calls Classifier.cross_val_predict(), and Classifier.cross_val_score().

Feature selection is independent of the real data, so it doesn’t matter that we passed in synthetic empirical data here. It chooses features that are only relevant for each summary statistic. Searching for the best model hyperparameters is the same, it is done independently of the observed data.

Parameters:

simfile (str) – The file containing copious simulations.
data_axes (list) – A list of the data axis identifiers to prune the simulations with. One or more of ‘abundance’, ‘pi’, ‘dxy’, ‘trait’. If this parameter is left blank it will use all data axes.
algorithm (str) – One of the supported Ensemble.Regressor algorithm identifier strings: ‘ab’, ‘gb’, ‘rf’, ‘rfq’.
quick (bool) – Whether to run fast but do a bad job.
verbose (bool) – Whether to print progress information.

Returns:

Returns the trained MESS.inference.Classifier with the cross- validation predictions for each simulation in the cv_preds member variable and the cross-validation scores per K-fold in the cv_scores member variable.

Parameter Estimation Cross-Validation¶

MESS.inference.parameter_estimation_cv(simfile, target_model=None, data_axes='', algorithm='rf', quick=True, verbose=False)¶

A convenience function to make it easier and more straightforward to run parameter estimation CV. This basically wraps the work of generating the synthetic community (dummy data), selecting which input data axes to retain (determines which summary statistics are used by the ML), creates the Regressor and calls Regressor.cross_val_predict() and Regressor.cross_val.score().

Feature selection is independent of the real data, so it doesn’t matter that we passed in synthetic empirical data here. It chooses features that are only relevant for each summary statistic. Searching for the best model hyperparameters is the same, it is done independently of the observed data.

Parameters:

simfile (str) – The file containing copious simulations.
target_model (str) – The target community assembly model to subsample the simulations with. If the parameter is blank it uses all simulations in the simfile.
data_axes (list) – A list of the data axis identifiers to prune the simulations with. One or more of ‘abundance’, ‘pi’, ‘dxy’, ‘trait’. If this parameter is left blank it will use all data axes.
algorithm (str) – One of the supported Ensemble.Regressor algorithm identifier strings: ‘ab’, ‘gb’, ‘rf’, ‘rfq’.
quick (bool) – Whether to run fast but do a bad job.
verbose (bool) – Whether to print progress information.

Returns:

Returns the trained MESS.inference.Regressor with the cross- validation predictions for each simulation in the cv_preds member variable and the cross-validation scores per K-fold in the cv_scores member variable.

Posterior Predictive Checks¶

MESS.inference.posterior_predictive_check(empirical_df, parameter_estimates, ax='', est_only=False, nsims=100, outfile='', use_lambda=True, force=False, verbose=False)¶

Perform posterior predictive simulations. This function will take parameter estimates and perform MESS simulations using these parameter values. It will then plot the resulting summary statistics in PC space, along with the summary statistics of the observed data. The logic of posterior predictive checks is that if the estimated parameters are a good fit to the data, then summary statistics generated using these parameters should resemble those of the real data.

Parameters:

empirical_df (pandas.DataFrame) – A DataFrame containing the empirical data. This df has a very specific format which is documented here.
parameter_estimates (pandas.DataFrame) – A DataFrame containing the the parameter estimates from a MESS.inference.Regressor.predict() call and optional prediction interval upper and lower bounds.
ax (bool) – The matplotlib axis to use for plotting. If not specified then a new axis will be created.
est_only (bool) – If True, drop the lower and upper prediction interval (PI) and just use the mean estimated parameters for generating posterior predictive simulations. If False, and PIs exist, then parameter values will be sampled uniformly between the lower and upper PI.
nsims (bool) – The number of posterior predictive simulations to perform.
outfile (bool) – A file path for saving the figure. If not specified the figure is simply not saved to the filesystem.
use_lambda (bool) – Whether to generated simulations using time as measured in _lambda or in generations.
force (bool) – Force overwrite previously generated simulations. If not force then re-running will append new simulations to previous ones.
verbose (bool) – Print detailed progress information.

Returns:

A matplotlib.pyplot.axis containing the plot.

Stats¶

MESS.stats.calculate_sumstats(diversity_df, sgd_bins=10, sgd_dims=1, metacommunity_traits=None, verbose=False)¶

Calculate all summary statistics on a dataset composed of one or more of the target data axes. This function will automatically detect the appropriate set of summary statistics to calculate based on the columns of the dataset. The passed in diversity_df may contain one or more of the following data axes:

abundance: Abundances per species as counts of individuals.
pi: Nucleotide diversity per base per species.
dxy: Absolute divergence between each species in the local community and the sister species in the metacommunity.
trait: The trait value of each species. Trait values are continuous and the distribution of trait values in the local community should be zero centered.

Note

This method should be used for calculating summary statistics for all empirical datasets as this is the method that is used to generate summary statistics for the simulations. This guarantees that observed and simulated statistics are calculated identically.

Parameters:

empirical_df (pandas.DataFrame) – A DataFrame containing the empirical data. This df has a very specific format which is specified above.
sgd_bins (int) – The number of bins per axis of the constructed SGD. This must match the number of bins specified for simulations.
sgd_dims (int) – The number of dimensions of the constructed SGD. This value can be either 1 (pi only) or 2 (both pi and dxy). This parameter must match the number of dimensions specified for simulations.
metacommunity_traits (array-like) – A list or np.array of the trait values from the metacommunity. Used for calculating some of the trait based summary statistics. These values take the same form as the values of the local trait data (i.e. they should be continuous and zero centered).
verbose (bool) – Whether to print some informational messages.

Returns:

Returns a pandas.Dataframe with one row containing all of the applicable summary statistic for the input dataframe.

MESS.stats.feature_sets(empirical_df=None)¶

Convenience function for getting all the different combinations of summary statistics (features) that are relevant for the axes available in the empirical data. For example, if you have abundance and pi in your observed data then this function will return 3 feature sets relevant to: using only abundance as the data, using only pi as the data, and using both abundance and pi.

Parameters:	empirical_df (pandas.DataFrame) – A DataFrame containing the empirical data. This df has a very specific format which is documented here. If none is provided then summary statistics for all possible combinations of data axes are returned
Returns:	A dictionary with keys being string descriptors the focal data axes and values being a list of summary statistics. Data axes that encompass more than one axis will be keyed by a single string with the name of both axes concatenated by a ‘+’ sign (e.g. “abundance+pi+trait” is the key that will give summary statistics relevant for all three of these data axes.

MESS.stats.Watterson(seqs, nsamples=0, per_base=True)¶

Calculate Watterson’s theta and optionally average over sequence length.

Parameters:

seqs (str/array-like) – The DNA sequence(s) over which to calculate the statistic. This parameter can be a single DNA sequence as a string, in which case we assume it is pooled data with IUPAC ambiguity codes indicating segregating sites. It can also be a string indicating the path to a fasta file, or a list of sequences which may or may not include the sample names (they will be removed).
nsamples (int) – The number of samples in the pooled sequence (for pooled data only).
per_base (bool) – Whether to average over the length of the sequence.

Return float:

The value of Watterson’s estimator of theta per base.

API Documentation¶

Simulation model¶

Region¶

Metacommunity¶

Local Community¶

Inference Procedure¶

Model Selection (Classification)¶

Parameter Estimation (Regression)¶

Classification Cross-Validation¶

Parameter Estimation Cross-Validation¶

Posterior Predictive Checks¶

Stats¶

Plotting¶