Skip to content

Obtaining the input data for PhyloNext

GBIF-mediated data in Parquet format

GBIF provides monthly-based periodic snapshots of species occurrence records. These snapshots are available on Amazon S3, Google GCS, and Microsoft Azure cloud storages. Data are stored in Apache Parquet format files. Parquet format allows running queries quickly and efficiently.

Data format

For more information on the data format, see the description of GBIF Public Datasets.

Obtaining a local snapshot of species occurrences from GBIF

If you would like to run the pipeline locally, you may download the full GBIF occurrence dump from the Amazon AWS cloud using the AWS CLI program (no AWS account required). E.g., to download 2022-05-01 dump to the GBIF directory in your home folder run:

aws s3 sync \
  s3://gbif-open-data-eu-central-1/occurrence/2022-05-01/occurrence.parquet/ \
  ~/GBIF/ \
  --no-sign-request

Alternative GBIF snapshot is also hosted by the Microsoft AI for Earth program. To download it using the AzCopy command-line utility run:

azcopy copy \
  "https://ai4edataeuwest.blob.core.windows.net/gbif/occurrence/2022-05-01/occurrence.parquet/*" \
  "~/GBIF/" \
  --recursive=true

A list of snapshots with download links is available at https://www.gbif.org/occurrence-snapshots.

Limitations

Currently, arrow v.10.0.0 does not support filtering of array-type columns.
E.g., it could be useful for removal of common geospatial issues in the data (see column issue in the parquet files).

For developers

See the related issues ARROW-16641 and ARROW-16702 at Apache Arrow issue tracker.

If you would like to remove records with geospatial issues from the data, please see the following section.

Subsetting occurrence data via GBIF API

Working with a subset of occurrences could increase the processing speed of the pipeline. It is possible to obtain a subset of species occurrence from GBIF programmatically. For this purpose, one may use an application programming interface (API) provided by GBIF. API is a set of rules and protocols that allow different software programs to communicate with each other. Therefore, APIs enable developers to access the data and functionality of a website or web-based service in a controlled, programmatic way.

First, you must specify a set of filters that should be applied to the data. For this purpose, create a simple text file in JSON format (see the example below). Please replace creator and notification_address with your own.

{
 "format": "SIMPLE_PARQUET",
 "creator": "USERNAME",
 "notification_address": [ "USEREMAIL" ],
 "predicate": {
  "type": "and",
  "predicates": [
   {
    "type": "equals",
    "key": "OCCURRENCE_STATUS",
    "value": "PRESENT"
   },
   {
    "type": "equals",
    "key": "HAS_GEOSPATIAL_ISSUE",
    "value": "false"
   },
   {
    "type": "not",
    "predicate": {
     "type": "in",
     "key": "ESTABLISHMENT_MEANS",
     "values": [
      "MANAGED",
      "INTRODUCED",
      "INVASIVE",
      "NATURALISED"
     ]
    }
   },
   {
    "type": "not",
    "predicate": {
     "type": "in",
     "key": "BASIS_OF_RECORD",
     "values": [
      "FOSSIL_SPECIMEN",
      "LIVING_SPECIMEN"
     ]
    }
   },
   {
    "type": "not",
    "predicate": {
     "type": "in",
     "key": "ISSUE",
     "values": [
      "TAXON_MATCH_HIGHERRANK"
     ]
    }
   }
  ]
 }
}

Note the usage of HAS_GEOSPATIAL_ISSUE, which is a shortcut for the following issues: ZERO_COORDINATE, COORDINATE_INVALID, COORDINATE_OUT_OF_RANGE, and COUNTRY_COORDINATE_MISMATCH.

Geospatial filters & issues

To read more about different Geospatial issues, see
https://docs.gbif.org/course-data-use/en/geospatial-filters-issues.html

If you would like to specify taxonomic or spatial scopes, you may add additional predicates (query expressions to retrieve occurrence record downloads), e.g.:

   {
    "type": "equals",
    "key": "GENUS_KEY",
    "value": "2978223"
   },
   {
    "type": "equals",
    "key": "COUNTRY",
    "value": "AU"
   }

GBIF predicates

An extensive list of supported predicates and query parameters could be found here:
https://www.gbif.org/developer/occurrence

To send download request, please fill in your user name and password and run:

USER="USERNAME"
PASSWORD="PASSWORD"

curl -Ssi \
  --user "$USER":"$PASSWORD" \
  -H "Content-Type: application/json" \
  -X POST -d @get_parquet.json \
  https://api.gbif.org/v1/occurrence/download/request

Please note the download ID returned by curl.
To check the status of the request and get the download link, use:

curl -Ss https://api.gbif.org/v1/occurrence/download/0003936-220831081235567 | jq .
(don't forget to replace the download ID!)

To download the results (zip archive with parquet files), use:

mkdir -p ~/GBIF_dumps
cd ~/GBIF_dumps
aria2c \
  https://api.gbif.org/v1/occurrence/download/request/0003936-220831081235567.zip \
  -o gbif_dump.zip

GBIF API beginners guide

A very nice introduction to the GBIF APIs can be found in the "GBIF API beginners guide" by John Waller.

Required software

To run the commands mentioned above, you may need to install the following software:
curl, jq, and aria2.
If you are using conda package manager, it could be done in a singe command:
conda install -c conda-forge curl aria2 jq

Obtaining a list of specieskeys for extinct species

To get a list of extinct species from GBIF, users may run the fetch_gbif_extinct_tax.py script.

Download the script:

wget https://raw.githubusercontent.com/vmikk/biodiverse-scripts/main/bin/fetch_gbif_extinct_tax.py
chmod +x fetch_gbif_extinct_tax.py

To use the script, please specify the GBIF taxon ID of the group of interest (e.g., 359 for Mammalia).

## Reptilia [358]
# https://www.gbif.org/species/358
python fetch_gbif_extinct_tax.py 358 --outfile extinct_tax_ids_Reptilia.txt

## Mammalia [359]
# https://www.gbif.org/species/359
python fetch_gbif_extinct_tax.py 359 --outfile extinct_tax_ids_Mammalia.txt

## Amphibia  [131]
# https://www.gbif.org/species/131
python fetch_gbif_extinct_tax.py 131 --outfile extinct_tax_ids_Amphibia.txt

## Birds - class Aves [212]
# https://www.gbif.org/species/212
python fetch_gbif_extinct_tax.py 212 --outfile extinct_tax_ids_Birds.txt

## Tracheophyta [7707728]
# https://www.gbif.org/species/7707728
python fetch_gbif_extinct_tax.py 7707728 --outfile extinct_tax_ids_Tracheophyta.txt