Pipeline parameters¶

PhyloNext includes numerous configurable parameters.
For convenience, they are grouped into several categories:

Input/output options
Data subsetting:
- Taxonomic scope
- Spatial scope
Spatial outliers removal
Occurrence filtering and binning
Diversity estimation
Visualization
- Interactive (Leaflet-based)
- Static (maps in pdf format)
Phylogenetic tree-related parameters
Generic options
Nextflow-specific parameters

Input/output options¶

Define where the pipeline should find input data and save output data.

Parameter	Description	Type	Default
`--input`	Path to the directory with parquet files (Parquet format) ¹	`directory`
`--outdir`	The output directory where the results will be saved	`directory`	./results
`--phytree`	Custom phylogenetic tree in Newick format (optional) ²	`file`

¹: GBIF occurrence dump in the Parquet format. Could be stored locally or in the cloud (S3 or Azure Blob storage).

²: Users have the option to provide their own custom phylogenetic tree. If this option is chosen, the tips of the phylogenetic tree should be labeled either with Latin binomials (for example, "Homo_sapiens"), or with Open Tree IDs (for example, "ott359899"). Be sure to adjust the --phylabels parameter accordingly.
Additionally, a set of phylogenetic trees, which come with descriptions and are pre-packaged with the pipeline, are available at this link:
https://github.com/vmikk/PhyloNext/tree/main/test_data/phy_trees

Taxonomic scope¶

Define which taxa should be analyzed.

Parameter	Description	Type	Example	Default
`--phylum`	Phylum to analyze ¹	`string`	"Chordata"
`--classis`	Class to analyze ¹ ²	`string`	"Mammalia"
`--order`	Order to analyze ¹	`string`	"Carnivora"
`--family`	Family to analyze ¹	`string`	"Felidae,Canidae"
`--genus`	Genus to analyze ¹	`string`	"Felis,Canis,Lynx"
`--specieskeys`	Custom list of GBIF specieskeys (text file with a single column) ³	`file`	"SpeciesKeys.txt"
`--noextinct`	File with extinct species specieskeys for their removal (file with a single column, with header)	`file`	"Carnivora"
`--excludehuman`	Exclude genus "Homo" from occurrence data	`boolean`	True	True

¹: Multiple comma-separated values allowed (e.g., "Felidae,Canidae" for the Family rank).

²: Unfortunately, class is a reserved keyword in Nextflow. Therefore, Latin classis is used as a parameter name.

³: The --specieskeys argument enables users to specify a custom list of GBIF species keys.
This should be provided as a text file containing a single column of species keys, each corresponding to a specific species as defined in the GBIF taxonomic backbone (https://www.gbif.org/species/search). These species keys can belong to various taxa, including non-monophyletic clades, and are used to filter and retrieve specific species data from GBIF. Example of the file:

, where e.g. the first species key in the list (5219243) corresponds to Vulpes vulpes (https://www.gbif.org/species/5219243)

Spatial scope¶

Spatial filters.

Parameter	Description	Type	Example
`--latmin`	Minimum latitude of species occurrences ¹	`number`	5.1
`--latmax`	Maximum latitude of species occurrences ¹	`number`	15.5
`--lonmin`	Minimum longitude of species occurrences ¹	`number`	47.0
`--lonmax`	Maximum longitude of species occurrences ¹	`number`	55.5
`--country`	Country code, ISO 3166 format ²	`string`	"DE,PL,CZ"
`--polygon`	Custom area of interest (a file with polygons in GeoPackage format)	`file`	`Brazil_ThreeStates.gpkg`
`--wgsrpd`	Polygons of World Geographical Regions ³	`file`	`pipeline_data/WGSRPD.RData`
`--regions`	Names of World Geographical Regions ⁴	`string`	"L1_EUROPE,L1_ASIA_TEMPERATE"

¹: Coordinates should be provided in decimal degrees

²: Country codes should be provided in the two-letter ISO 3166-1 alpha-2 coding system,
see details here (column Alpha-2 code),
and this Wikipedia article.

³: PhyloNext ships WGSRPD shapefile as a built-in data. To use it, specify a full path to the data (e.g., $(realpath "${HOME}/.nextflow/assets/vmikk/PhyloNext/pipeline_data/WGSRPD.RData"))

⁴: Multiple comma-separated values allowed.

Currently, Level-1 and Level-2 WGSRPD polygons are supported.

Level-1 polygons (continental level): WGRSP-L1

Level-2 polygons (regional or subcontinental level): WGRSP-L2

Number	Polygon code
1	`L1_EUROPE`
2	`L1_AFRICA`
3	`L1_ASIA_TEMPERATE`
4	`L1_ASIA_TROPICAL`
5	`L1_AUSTRALASIA`
6	`L1_PACIFIC`
7	`L1_NORTHERN_AMERICA`
8	`L1_SOUTHERN_AMERICA`
9	`L1_ANTARCTIC`
------	---------------------------------
10	`L2_Northern_Europe`
11	`L2_Middle_Europe`
12	`L2_Southwestern_Europe`
13	`L2_Southeastern_Europe`
14	`L2_Eastern_Europe`
20	`L2_Northern_Africa`
21	`L2_Macaronesia`
22	`L2_West_Tropical_Africa`
23	`L2_West_Central_Tropical_Africa`
24	`L2_Northeast_Tropical_Africa`
25	`L2_East_Tropical_Africa`
26	`L2_South_Tropical_Africa`
27	`L2_Southern_Africa`
28	`L2_Middle_Atlantic_Ocean`
29	`L2_Western_Indian_Ocean`
30	`L2_Siberia`
31	`L2_Russian_Far_East`
32	`L2_Middle_Asia`
33	`L2_Caucasus`
34	`L2_Western_Asia`
35	`L2_Arabian_Peninsula`
36	`L2_China`
37	`L2_Mongolia`
38	`L2_Eastern_Asia`
40	`L2_Indian_Subcontinent`
41	`L2_Indo_China`
42	`L2_Malesia`
43	`L2_Papuasia`
50	`L2_Australia`
51	`L2_New_Zealand`
60	`L2_Southwestern_Pacific`
61	`L2_South_Central_Pacific`
62	`L2_Northwestern_Pacific`
63	`L2_North_Central_Pacific`
70	`L2_Subarctic_America`
71	`L2_Western_Canada`
72	`L2_Eastern_Canada`
73	`L2_Northwestern_USA`
74	`L2_North_Central_USA`
75	`L2_Northeastern_USA`
76	`L2_Southwestern_USA`
77	`L2_South_Central_USA`
78	`L2_Southeastern_USA`
79	`L2_Mexico`
80	`L2_Central_America`
81	`L2_Caribbean`
82	`L2_Northern_South_America`
83	`L2_Western_South_America`
84	`L2_Brazil`
85	`L2_Southern_South_America`
90	`L2_Subantarctic_Islands`
91	`L2_Antarctic_Continent`

Custom polygons

It is possible to use custom polygons instead of WGSRPD file. Polygons should be in the simple feature collection format (class sf of the sf package, geometry type: MULTIPOLYGON) and must contain LevelName column. Data should be saved as a serialized R object (with saveRDS).

Spatial outliers removal¶

Parameter	Description	Type	Example	Default
`--dbscan`	Remove spatial outliers with density-based clustering	`boolean`	True	False
`--dbscannoccurrences`	Minimum species occurrence to perform DBSCAN	`integer`	30	30
`--dbscanepsilon`	DBSCAN parameter epsilon, km	`integer`	1500	1500
`--dbscanminpts`	DBSCAN min number of points	`integer`	3	3

Currently, only density-based clustering (DBSCAN) algorithm is implemented for removal of spatial outliers.
The DBSCAN algorithm requires 2 parameters:
- eps (--dbscanepsilon), which specifies how close points should be to each other to be considered a part of a cluster
If the distance between two points is lower or equal to this value, these points are considered neighbors.
- minPoints (--dbscanminpts), the minimum number of points to form a dense region.

Parameter --dbscannoccurrences is used to skip DBSCAN filtering for species with low number of unique points (e.g., <30).

For more details, see the blog post "Outlier Detection Using DBSCAN" by John Waller.

Occurrence filtering and binning¶

Parameter	Description	Type	Example	Default
`--minyear`	Minimum year of record's occurrences	`integer`	2000	1945
`--maxyear`	Maximum year of record's occurrences	`integer`	2010
`--coordprecision`	Coordinate precision threshold, decimal degrees ¹	`number`	0.1	0.1
`--coorduncertainty`	Maximum allowed coordinate uncertainty, meters ¹	`number`	10000	10000
`--coorduncertaintyexclude`	Black list of coordinate uncertainty values ¹	`string`	"9999"	"301,3036,999,9999"
`--basisofrecordinclude`	Round spatial coordinates to N decimal places ²	`string`	"PRESERVED_SPECIMEN"
`--basisofrecordexclude`	Round spatial coordinates to N decimal places ²	`string`	"FOSSIL_SPECIMEN"	"FOSSIL_SPECIMEN,LIVING_SPECIMEN"
`--h3resolution`	Spatial resolution of the H3 geospatial indexing system	`integer`	4	4
`--roundcoords`	Round spatial coordinates to N decimal places ³	`integer`	2	2

For spatial binning of species occurrences, PhyloNext uses H3 geospatial indexing system developed by Uber. H3 represents a hierarchical geospatial index, where each hexagonal grid cell has a unique index (e.g., 8a1f05835a37fff for GBIF headquarter). H3 supports 16 resolutions (from 1 to 16), which can be selected in PhyloNext using --h3resolution parameter. By default, PhyloNext uses resolution 4, which corresponds to a hexagon with edge length of 22.6 km and cell area of 1170 km². More details on H3 resolutions could be found here.

¹: It's possible to remove occurrence records with a high level of coordinate uncertainty or low precision. It can be done using some threshold values. Also, there are several known default values for coordinate uncertainty in meters, which could be black-listed as well (e.g., these values may correspond to records linked to country centroids).
NB! Records with missing values will not be removed, as many publishers do not fill these fields in the database.
For details, see the blog post. "Common things to look out for when post-processing GBIF downloads" by John Waller.

²: For details, see description of a Darwin Core term Basis of record and a short explanation here.
Multiple comma-separated values allowed.

³: As DBSCAN filtering is very computationally intensive, it is possible to reduce dataset size (almost without loosing precision) by rounding record coordinates. By default, coordinates are rounded to 2 decimal places, which corresponds to an accuracy of ~1.11 km at the equator. To disable coordinate rounding, set --roundcoords to a negative values (e.g., -1).

Removal of common spatial errors¶

PhyloNext implements several filters analogous to the filters in CoordinateCleaner R package by Alexander Zizka:

Country and province centroids
Capital coordinates
Coordinates of biodiversity institutions
Urban areas
Seas

Parameter	Description	Type	Default
`--terrestrial`	Land polygon for removal of non-terrestrial occurrences	`file`	enabled
`--rmcountrycentroids`	Polygons with country and province centroids	`file`	disabled
`--rmcountrycapitals`	Polygons with country capitals	`file`	disabled
`--rminstitutions`	Polygons with biological institutions and museums	`file`	disabled
`--rmurban`	Polygons with urban areas	`file`	disabled

PhyloNext provides files for removal of common spatial errors, the data are built-in in the pipeline_data direcory and can be selected in the following way:

--terrestrial `$(realpath "${HOME}/.nextflow/assets/vmikk/PhyloNext/pipeline_data/Land_Buffered_025_dgr.RData")`
--rmcountrycentroids `$(realpath "${HOME}/.nextflow/assets/vmikk/PhyloNext/pipeline_data/CC_CountryCentroids_buf_1000m.RData")`
--rmcountrycapitals `$(realpath "${HOME}/.nextflow/assets/vmikk/PhyloNext/pipeline_data/CC_Capitals_buf_10000m.RData")`
--rminstitutions `$(realpath "${HOME}/.nextflow/assets/vmikk/PhyloNext/pipeline_data/CC_Institutions_buf_100m.RData")`
--rmurban `$(realpath "${HOME}/.nextflow/assets/vmikk/PhyloNext/pipeline_data/CC_Urban.RData")`

Alternatively, users can provide own files with custom polygons.

Diversity estimation¶

Diversity estimation is performed using Biodiverse program by Shawn Laffan. Therefore, parameter values should correspond to the Biodiverse values.

Parameter	Description	Type	Example	Default
`--indices`	Diversity and endemism indices to estimate ¹	`string`	"calc_richness,calc_pd,calc_pe"	"calc_richness,calc_simpson_shannon,calc_endemism_whole,calc_pd,calc_pe,calc_phylo_corrected_weighted_endemism,calc_phylo_rpd1,calc_phylo_rpd2,calc_phylo_rpe1,calc_phylo_rpe2"
`--iterations`	Number of randomisation iterations for standardized effect size estimation	`integer`	1000	1000
`--biodiversethreads`	Number of Biodiverse threads	`integer`	10	10
`--randname`	Randomisation scheme type ²	`string`	"rand_structured"	"rand_structured"
`--randconstrain`	Polygons to perform spatially constrained randomization (GeoPackage format) ²	`file`	`ZoogeographicRegions.gpkg`

¹: Comma-separated list of metrics. More than 350 indices are supported.
For details see Biodiverse manual.
The most common indices are listed here: Diversity indices.

²: In order to estimate standardized effect sizes (SES) of different diversity metrics and determine whether the obtained estimates are more extreme than what might be predicted given a null model, Biodiverse requires to perform randomization. By default, Biodiverse uses the rand_structured randomization algorithm, which shuffles species occurrences randomly across the entire area while preserving the species richness of each sample or grid cell. This method works well for most studies and is very effective. However, in cases where the analysis is conducted on a large scale and involves samples from different biomes or continents, the total pool of species may span multiple environments. As a result, randomizations may allocate a polar taxon to the tropics, or a desert taxon to a rainforest. One way to address this issue is by performing randomizations within a subset of data based on a spatial condition, such as biomes or zoogeographic regions. PhyloNext allows the user to specify such conditions as a GeoPackage file with multiple polygons. This feature enables users to keep any species within the biome in which they are found while still randomly relocating them. Hence, the randomization process can be tailored to suit the specific needs of the study and ensure that the results accurately reflect the distribution of species in the area of interest.

Randomization algorithms in Biodiverse

See a series of blog posts by Shawn Laffan:
1. Better control of randomisations
2. How the rand_structured algorithm works
3. Spatially partition your randomisations

To illustrate how spatially-constrained randomizations can be executed for Mammals on a global scale, one can utilize the zoogeographical regions established by Shen et al., 2022, DOI:10.4236/oje.2022.123014.
These regions are conveniently available in the pipeline_data/ZoogeographicRegions.gpkg file, which is included in PhyloNext.
The file comprises seven biogeographical kingdoms:
- West Palaearctic kingdom
- East Palaearctic kingdom
- Indo-Pacific kingdom
- Afrotropical kingdom
- Australian kingdom
- Nearctic kingdom
- Neotropical kingdom

ZooGeographical_Regions
(original data source - https://www.scirp.org/journal/paperinformation.aspx?paperid=116248)

To activate spatially-constrained randomizations using these biogeographic regions, add the following argument to your command:

--randconstrain `$(realpath "${HOME}/.nextflow/assets/vmikk/PhyloNext/pipeline_data/ZoogeographicRegions.gpkg")`

Visualization - interactive¶

Interactive visualization depends on Leaflet library.

Parameter	Description	Type	Example	Default
`--leaflet_var`	Diversity and endemism indices to estimate ¹	`string`	"PD,SES_PD"	"RICHNESS_ALL,PD,SES_PD,PD_P,ENDW_WE,SES_ENDW_WE,PE_WE,SES_PE_WE,PE_CWE,SES_PE_CWE,CANAPE,Redundancy"
`--leaflet_color`	Color scheme for continuous variables ²	`string`	"RdYlBu"	"RdYlBu"
`--leaflet_palette`	Color palette for continuous variables ³	`string`	"quantile"	"quantile"
`--leaflet_bins`	Number of color bins for continuous variables	`integer`	5	5
`--leaflet_sescolor`	Color scheme for standardized effect sizes, SES ⁴	`string`	"hotspots"	"threat"
`--leaflet_redundancy`	Redundancy threshold for hiding the grid cells with low number of records ⁵	`number`	85	0
`--leaflet_canapesuper`	Include the `super-endemism` class in CANAPE results ⁶	`boolean`	True	False

¹: For a list of indices available in Biodiverse, see here.
Note: to show the index on a map, please include appropriate name of the Biodiverse subroutine with --indices parameter.
To display Standardized-Effect-Size-based variables, add SES_ prefix to the index name (e.g., SES_PD).
For CANAPE (categorical analysis of neo- and paleoendemism; Mishler et al., 2014), add CANAPE to the variables list.
To display redundancy index (measure of sampling completeness; see Mishler et al., 2020), add Redundancy to the variables list.

²: The name of a a color scheme from the ColorBrewer designed by Cynthia Brewer. For more options see here.

³: Defines mapping of data values to colors . See Leaflet docs:

colorNumeric is a simple linear mapping from continuous numeric data to an interpolated palette
colorBin also maps continuous numeric data, but performs binning based on value (see the cut function)
colorQuantile similarly bins numeric data, but via the quantile function

⁴: The default color scheme (threat-type) for SES values follows Mishler et al., 2014, where high SES values are in blue, and low SES values are in red.
Alternatively, it is possible to use the hotspot-type palette, where areas with the red values indicate grid cells that contain significantly higher diversity or endemism than expected (hotspots), while the blue values indicate grid cells that contain significantly lower diversity/endemism than expected (coldspots).

⁵: Sampling redundancy is defined as [1 – (richness / number of specimens)] in a grid cell. By default, parameter --leaflet_redundancy is set to 0, which will display all grid cells.

⁶: In contrast with the original publication by Mishler et al. (2020), the current version of Biodivere (v.4.1) classifies endemism into 3 categories (super-endemism type is merged with mixed endemism, see https://biodiverse-analysis-software.blogspot.com/2022/10/biodiverse-now-calculates-canape-for-you.html). In PhyloNext, it is possible to use both coloring schemes (with and without super-endemism).

CANAPE

CANAPE (Categorical Analysis of Palaeo and Neo Endemism) is able to distinguish different types of centres of endemism, and can thus give insights into different evolutionary and ecological processes that may be responsible for these patterns.

The centres of paleo-endemism indicate places where there are over-representation of long branches that are rare across the landscape.
The centres of neo-endemism indicate an area where there is an over-representation of short branches that are rare on the landscape.
Mixture of both paleo-endemism and neo-endemism
Super-endemic sites

Visualization - static¶

Parameter	Description	Type	Example	Default
`--plotvar`	Variables to plot ¹	`string`	"PD"	"RICHNESS_ALL,PD,PD_P"
`--plottype`	Plot type	`string`		raw
`--plotformat`	Plot format (jpg,pdf,png)	`string`		pdf
`--plotwidth`	Plot width (default, 18 inches)	`number`		18
`--plotheight`	Plot height (default, 18 inches)	`number`		18
`--plotunits`	Plot size units (in,cm)	`string`		in
`--world`	World basemap ²	`file`		enabled

¹: Multiple comma-separated values allowed. For details see --leaflet_var parameter notes.

²: World base map is based on Natural Earth data and is located in the pipeline_data/WorldMap_NaturalEarth_Medium.RData file.

Parameter	Description	Type	Default
`--phylabels`	Type of tip labels on a phylogenetic tree ("OTT" or "Latin") ¹	`string`	Latin
`--taxgroup`	Specific taxonomy group in Open Tree of Life (default, "All_life") ²	`string`	All_life
`--maxage`	Manually assign root age for a tree obtained from Open Tree of Life ³	`number`
`--phyloonly`	Prune Open Tree tips for which there are no phylogenetic inputs ⁴	`boolean`	false

¹: If a custom phylogenetic tree was provided by user (with --phytree parameter), it's important to specify the type of tip labels. Currently, two labeling schemes are supported:

Latin binomials in Genus_species format (e.g., "Homo_sapiens").
NB! whitespaces in the name and sub-species ranks (variety, strain, form, etc.) are not allowed!
Open Tree of Life IDs (e.g., "ott770315")

²: In the case if tree tips are in OTT-format, it is possible to limit the query scope at Taxonomic Name Resolution Service of Open Tree of Life. List available taxonomic contexts in rotl v.3.0.12 package:

All life (default)
Animals
- Birds
- Tetrapods
- Mammals
- Amphibians
- Vertebrates
- Arthropods
- Molluscs
- Nematodes
- Platyhelminthes
- Annelids
- Cnidarians
- Arachnids
- Insects
Land plants
- Hornworts
- Mosses
- Liverworts
- Vascular plants
- Club mosses
- Ferns
- Seed plants
- Flowering plants
- Monocots
- Eudicots
- Rosids
- Asterids
- Asterales
- Asteraceae
- Aster
- Symphyotrichum
- Campanulaceae
- Lobelia
Fungi
- Basidiomycetes
- Ascomycetes
Bacteria
Archaea
SAR group
- Excavata
- Amoebozoa
- Centrohelida
- Haptophyta
- Apusozoa
- Diatoms
- Ciliates
- Forams

Taxonomic context

Whitespace is not allowed in taxonomic context! Please replace space with underscore (e.g., "Seed plants" should be speciefied as --taxgroup Seed_plants).

³: In some cases, when there are no age estimates available for the phylogenetic tree nodes, automatic tree retrieval could fail. Therefore, it could be helpful to override the maximum node age to obtain the tree. Note, however, that in this case, units of some of the diversity indices (e.g., PD) become arbitrary.

⁴: Open Tree utilizes taxonomic data to augment the structure and completeness of the synthetic tree when source phylogenies are absent or sparsely sampled (see Hinchliff et al., 2015). If required, branches lacking phylogenetic support could be removed from the tree using --phyloonly argument.

Open Tree of Life

More information about Open Tree of Life is available at https://opentreeoflife.github.io/.

Generic options¶

Parameter	Description	Type	Default
`--deriveddataset`	Export list of GBIF dataset keys for the filtered species occurrences ¹	`boolean`	True
`--help`	Display help text (pipeline)	`boolean`
`--helpMsg`	Display help text (pipeline)	`boolean`

¹: Could be used for citation and preparing a derived dataset with unique DOI.

Nextflow-specific parameters¶

Parameter types

PhyloNext-specific parameters are specified using double dash prefix (e.g., --something value). Parameters related to the Nextflow workflow manager starts with a single dash prefix (e.g., -work-dir wd)

Parameter	Description	Type	Example	Default
`-qs`	Queue size (max number of processes that can be executed in parallel)	`integer`	8	number of available CPUs
`-w`	Path to the working directory to store intermediate results	`string`	"$(pwd)/wd"	"work"
`-resume`	Execute the pipeline using the cached results. Useful to continue executions that was stopped by an error
`-profile`	Configuration profiles (set of configuration attributes) ¹	`string`	test,docker
`-params-file`	Parameter file in YAML or JSON format ¹	`file`	`Mammals.yaml`
`-c`	Configuration file ¹	`file`		`nextflow.config`

¹: For examples, see the Configuration and profiles and Parameter file sections in the Usage documentation.

Configuration profiles

For more information on Nextflow profiles, see https://www.nextflow.io/docs/latest/config.html#config-profiles.