Obtaining the input data for PhyloNext¶
GBIF-mediated data in Parquet format¶
GBIF provides monthly-based periodic snapshots of species occurrence records. These snapshots are available on Amazon S3, Google GCS, and Microsoft Azure cloud storages. Data are stored in Apache Parquet format files. Parquet format allows running queries quickly and efficiently.
Data format
For more information on the data format, see the description of GBIF Public Datasets.
Obtaining a local snapshot of species occurrences from GBIF¶
If you would like to run the pipeline locally, you may download the full GBIF occurrence dump from the Amazon AWS cloud using the AWS CLI program (no AWS account required). E.g., to download 2022-05-01
dump to the GBIF
directory in your home folder run:
aws s3 sync \
s3://gbif-open-data-eu-central-1/occurrence/2022-05-01/occurrence.parquet/ \
~/GBIF/ \
--no-sign-request
Alternative GBIF snapshot is also hosted by the Microsoft AI for Earth program. To download it using the AzCopy command-line utility run:
azcopy copy \
"https://ai4edataeuwest.blob.core.windows.net/gbif/occurrence/2022-05-01/occurrence.parquet/*" \
"~/GBIF/" \
--recursive=true
A list of snapshots with download links is available at https://www.gbif.org/occurrence-snapshots.
Limitations
Currently, arrow
v.10.0.0 does not support filtering of array-type columns.
E.g., it could be useful for removal of common geospatial issues in the data (see column issue
in the parquet files).
For developers
See the related issues ARROW-16641 and ARROW-16702 at Apache Arrow issue tracker.
If you would like to remove records with geospatial issues from the data, please see the following section.
Subsetting occurrence data via GBIF API¶
Working with a subset of occurrences could increase the processing speed of the pipeline. It is possible to obtain a subset of species occurrence from GBIF programmatically. For this purpose, one may use an application programming interface (API) provided by GBIF. API is a set of rules and protocols that allow different software programs to communicate with each other. Therefore, APIs enable developers to access the data and functionality of a website or web-based service in a controlled, programmatic way.
First, you must specify a set of filters that should be applied to the data.
For this purpose, create a simple text file in JSON format (see the example below).
Please replace creator
and notification_address
with your own.
{
"format": "SIMPLE_PARQUET",
"creator": "USERNAME",
"notification_address": [ "USEREMAIL" ],
"predicate": {
"type": "and",
"predicates": [
{
"type": "equals",
"key": "OCCURRENCE_STATUS",
"value": "PRESENT"
},
{
"type": "equals",
"key": "HAS_GEOSPATIAL_ISSUE",
"value": "false"
},
{
"type": "not",
"predicate": {
"type": "in",
"key": "ESTABLISHMENT_MEANS",
"values": [
"MANAGED",
"INTRODUCED",
"INVASIVE",
"NATURALISED"
]
}
},
{
"type": "not",
"predicate": {
"type": "in",
"key": "BASIS_OF_RECORD",
"values": [
"FOSSIL_SPECIMEN",
"LIVING_SPECIMEN"
]
}
},
{
"type": "not",
"predicate": {
"type": "in",
"key": "ISSUE",
"values": [
"TAXON_MATCH_HIGHERRANK"
]
}
}
]
}
}
Note the usage of HAS_GEOSPATIAL_ISSUE
, which is a shortcut for the following issues:
ZERO_COORDINATE
, COORDINATE_INVALID
, COORDINATE_OUT_OF_RANGE
, and COUNTRY_COORDINATE_MISMATCH
.
Geospatial filters & issues
To read more about different Geospatial issues, see
https://docs.gbif.org/course-data-use/en/geospatial-filters-issues.html
If you would like to specify taxonomic or spatial scopes, you may add additional predicates (query expressions to retrieve occurrence record downloads), e.g.:
{
"type": "equals",
"key": "GENUS_KEY",
"value": "2978223"
},
{
"type": "equals",
"key": "COUNTRY",
"value": "AU"
}
GBIF predicates
An extensive list of supported predicates and query parameters could be found here:
https://www.gbif.org/developer/occurrence
To send download request, please fill in your user name and password and run:
USER="USERNAME"
PASSWORD="PASSWORD"
curl -Ssi \
--user "$USER":"$PASSWORD" \
-H "Content-Type: application/json" \
-X POST -d @get_parquet.json \
https://api.gbif.org/v1/occurrence/download/request
Please note the download ID returned by curl
.
To check the status of the request and get the download link, use:
To download the results (zip archive with parquet files), use:
mkdir -p ~/GBIF_dumps
cd ~/GBIF_dumps
aria2c \
https://api.gbif.org/v1/occurrence/download/request/0003936-220831081235567.zip \
-o gbif_dump.zip
GBIF API beginners guide
A very nice introduction to the GBIF APIs can be found in the "GBIF API beginners guide" by John Waller.
Required software
To run the commands mentioned above, you may need to install the following software:
curl
, jq
, and aria2
.
If you are using conda
package manager, it could be done in a singe command:
conda install -c conda-forge curl aria2 jq
Obtaining a list of specieskeys for extinct species¶
To get a list of extinct species from GBIF, users may run the fetch_gbif_extinct_tax.py
script.
Download the script:
wget https://raw.githubusercontent.com/vmikk/biodiverse-scripts/main/bin/fetch_gbif_extinct_tax.py
chmod +x fetch_gbif_extinct_tax.py
To use the script, please specify the GBIF taxon ID of the group of interest (e.g., 359 for Mammalia).
## Reptilia [358]
# https://www.gbif.org/species/358
python fetch_gbif_extinct_tax.py 358 --outfile extinct_tax_ids_Reptilia.txt
## Mammalia [359]
# https://www.gbif.org/species/359
python fetch_gbif_extinct_tax.py 359 --outfile extinct_tax_ids_Mammalia.txt
## Amphibia [131]
# https://www.gbif.org/species/131
python fetch_gbif_extinct_tax.py 131 --outfile extinct_tax_ids_Amphibia.txt
## Birds - class Aves [212]
# https://www.gbif.org/species/212
python fetch_gbif_extinct_tax.py 212 --outfile extinct_tax_ids_Birds.txt
## Tracheophyta [7707728]
# https://www.gbif.org/species/7707728
python fetch_gbif_extinct_tax.py 7707728 --outfile extinct_tax_ids_Tracheophyta.txt