Introduction
Overall, my research aims to understand how environmental factors, like wild fires, influence biodiversity. One of the most used databases for measuring biodiversity is the Global Biodiversity Information Facility GBIF. This is post is a brief introduction of how to use GBIF, using the example of mapping a few conifer species in California.
Load the libraries
library(rgbif)
library(tidyverse)
library(data.table)
library(maps)
library(leaflet)
Part 1 -Mapping one species
Using the R package function occ_data() request data from the GBIF database with the species name, selecting the observations that have a coordinate, and limiting the query to 10000 observations.
pinus_balfouriana <- occ_data(scientificName = "Pinus balfouriana", hasCoordinate = TRUE, limit = 10000)
Inspect the data structure.
summary(pinus_balfouriana)
## Length Class Mode
## meta 4 -none- list
## data 134 tbl_df list
There are many different groups that contribute data to GBIF. Make sure you cite them accordingly so we can continue to have a great stream of species occurrence data. As a a few examples this species has currated research grade observations from iNaturalist and many from the Humboldt State University.
head(gbif_citation(pinus_balfouriana), 2)
## [[1]]
## <<rgbif citation>>
## Citation: iNaturalist contributors, iNaturalist (2022). iNaturalist
## Research-grade Observations. iNaturalist.org. Occurrence dataset
## https://doi.org/10.15468/ab3s5x accessed via GBIF.org on 2022-04-05..
## Accessed from R via rgbif (https://github.com/ropensci/rgbif) on
## 2022-04-05
## Rights: http://creativecommons.org/licenses/by-nc/4.0/legalcode
##
## [[2]]
## <<rgbif citation>>
## Citation: Humboldt State University (2022). HSC - Humboldt State University
## Vascular Plant Herbarium. Occurrence dataset
## https://doi.org/10.15468/qguk7r accessed via GBIF.org on 2022-04-05..
## Accessed from R via rgbif (https://github.com/ropensci/rgbif) on
## 2022-04-05
## Rights: http://creativecommons.org/licenses/by-nc/4.0/legalcode
Take a look at what types of data are collected.
head(names(pinus_balfouriana$data))
## [1] "key" "scientificName" "decimalLatitude" "decimalLongitude"
## [5] "issues" "datasetKey"
The default GBIF query returns 126 columns of data. We do not have time to go through all of them for a single post so I will subset some important ones for some exploratory plotting purposes. Here I subset the Latitude and Longitude coordinates that we will use for mapping, if the species does occur (important for modeling in future posts), how uncertain the observation is in meters, and the references for the observation.
pinus_balfouriana_coords <- pinus_balfouriana$data[ , c("decimalLongitude", "decimalLatitude",
"occurrenceStatus", "coordinateUncertaintyInMeters", "references")]
# check out how the data is structured
head(pinus_balfouriana_coords)
## # A tibble: 6 × 5
## decimalLongitude decimalLatitude occurrenceStatus coordinateUncert… references
## <dbl> <dbl> <chr> <dbl> <chr>
## 1 -118. 36.6 PRESENT 1742 https://w…
## 2 -118. 36.8 PRESENT 130 https://w…
## 3 -118. 36.5 PRESENT NA https://w…
## 4 -118. 36.5 PRESENT NA https://w…
## 5 -118. 36.4 PRESENT NA https://w…
## 6 -118. 36.4 PRESENT NA https://w…
summary(pinus_balfouriana_coords)
## decimalLongitude decimalLatitude occurrenceStatus
## Min. :-123.21 Min. :35.92 Length:417
## 1st Qu.:-122.53 1st Qu.:36.49 Class :character
## Median :-118.56 Median :36.77 Mode :character
## Mean :-119.26 Mean :38.14
## 3rd Qu.:-118.32 3rd Qu.:40.95
## Max. : 11.45 Max. :50.83
##
## coordinateUncertaintyInMeters references
## Min. : 2 Length:417
## 1st Qu.: 10 Class :character
## Median : 201 Mode :character
## Mean : 2479
## 3rd Qu.: 802
## Max. :28534
## NA's :152
# pinus_balfouriana_coords$decimalLongitude # remove this row of bad data
pinus_balfouriana_coords <- slice(pinus_balfouriana_coords, -(278))
# two map functions... be clear! There are a few map functions with these different libraries loaded.
#?map()
maps::map(database = "state", region = "california")
points(pinus_balfouriana_coords[ , c("decimalLongitude", "decimalLatitude")], pch = ".", col = "red", cex = 3)
Subset our search to only Northern California
mm_geometry <- paste('POLYGON((-124.4323 42.0021, -121.5045 42.0021, -121.5045 40.194, -124.4323 40.194, -124.4323 42.0021))')
pinus_balfouriana_NC <- occ_data(scientificName = "Pinus balfouriana", hasCoordinate = TRUE, limit = 10000,
geometry = mm_geometry )
head(pinus_balfouriana_NC)
## $meta
## $meta$offset
## [1] 0
##
## $meta$limit
## [1] 300
##
## $meta$endOfRecords
## [1] TRUE
##
## $meta$count
## [1] 121
##
##
## $data
## # A tibble: 121 × 122
## key scientificName decimalLatitude decimalLongitude issues datasetKey
## <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 3455305209 Pinus balfouri… 41.3 -123. cdrou… 50c9509d-…
## 2 3328047476 Pinus balfouri… 41.3 -122. cdrou… 50c9509d-…
## 3 3456635831 Pinus balfouri… 40.9 -123. cdrou… 50c9509d-…
## 4 3456682356 Pinus balfouri… 41.2 -123. cdrou… 50c9509d-…
## 5 2620109834 Pinus balfouri… 40.7 -123. cdrou… 50c9509d-…
## 6 2814362317 Pinus balfouri… 41.2 -123. cdrou… 50c9509d-…
## 7 2850958731 Pinus balfouri… 41.3 -122. cdrou… 50c9509d-…
## 8 2862286454 Pinus balfouri… 41.3 -123. cdrou… 50c9509d-…
## 9 3067629037 Pinus balfouri… 40.9 -123. cdrou… 50c9509d-…
## 10 3384272954 Pinus balfouri… 40.9 -123. cdrou… 50c9509d-…
## # … with 111 more rows, and 116 more variables: publishingOrgKey <chr>,
## # installationKey <chr>, publishingCountry <chr>, protocol <chr>,
## # lastCrawled <chr>, lastParsed <chr>, crawlId <int>,
## # hostingOrganizationKey <chr>, basisOfRecord <chr>, occurrenceStatus <chr>,
## # taxonKey <int>, kingdomKey <int>, phylumKey <int>, classKey <int>,
## # orderKey <int>, familyKey <int>, genusKey <int>, speciesKey <int>,
## # acceptedTaxonKey <int>, acceptedScientificName <chr>, kingdom <chr>, …
pinus_balfouriana_NC_coords <- pinus_balfouriana_NC$data[ , c("decimalLongitude", "decimalLatitude",
"individualCount", "occurrenceStatus", "coordinateUncertaintyInMeters",
"institutionCode", "references")]
Plot the data in a few different ways to see if there is anything strange.
plot(pinus_balfouriana_NC_coords$decimalLongitude, pinus_balfouriana_NC_coords$decimalLatitude) # examine the data
pinus_balfouriana_NC_coords %>% leaflet() %>% addTiles() %>%
addMarkers(~decimalLongitude, ~decimalLatitude)
Part 2- Map Multiple species
First test the workflow with only a few species then do entire batch.
# The test is commented out. Uncomment to test first.
# mm_species <- c("Pinus balfouriana", "pinus albicaulis", "pinus monticola") # uncomment to run small test version
# Entire miracle mile species set
mm_species <- c("Pinus balfouriana", "pinus albicaulis", "pinus monticola", "pinus jeffreyi", "pinus ponderosa", "pinus contorta", "pinus lambertiana", "abies concolor", "abies magnifica", "abies lasiocarpa", "picea engelmannii", "picea breweriana", "tsuga mertensiana", "pseudotsuga menziesii", "taxus brevifolia", "calocedrus decurrens", "juniperus communis", "juniperus occidentalis")
mm_all <- occ_data(scientificName = mm_species, hasCoordinate = TRUE, limit = 10000,
geometry = mm_geometry)
summary(mm_all)
## Length Class Mode
## Pinus balfouriana 2 -none- list
## pinus albicaulis 2 -none- list
## pinus monticola 2 -none- list
## pinus jeffreyi 2 -none- list
## pinus ponderosa 2 -none- list
## pinus contorta 2 -none- list
## pinus lambertiana 2 -none- list
## abies concolor 2 -none- list
## abies magnifica 2 -none- list
## abies lasiocarpa 2 -none- list
## picea engelmannii 2 -none- list
## picea breweriana 2 -none- list
## tsuga mertensiana 2 -none- list
## pseudotsuga menziesii 2 -none- list
## taxus brevifolia 2 -none- list
## calocedrus decurrens 2 -none- list
## juniperus communis 2 -none- list
## juniperus occidentalis 2 -none- list
mm_species_coords_list <- vector("list", length(mm_species))
names(mm_species_coords_list) <- mm_species
for (x in mm_species) {
coords <- mm_all[[x]]$data[ , c("decimalLongitude", "decimalLatitude", "occurrenceStatus", "coordinateUncertaintyInMeters", "institutionCode", "references")]
mm_species_coords_list[[x]] <- data.frame(species = x, coords)
}
Using the rbindlist() function from the data.frame package to take all of the species observations from a list to a large data.frame. The columns are the species name, the latitude and longitude coordinates, whether or not there was an observation, if there is any uncertainty about how accurate the GPS coordinate was, what platform the observation was made on, and the specific reference for the observation. Make sure you cite the references so we can keep these rich data streams coming!
tree_df <- rbindlist(mm_species_coords_list, fill = T)
dim(tree_df)
## [1] 5637 7
head(tree_df)
## species decimalLongitude decimalLatitude occurrenceStatus
## 1: Pinus balfouriana -122.7521 41.33424 PRESENT
## 2: Pinus balfouriana -122.4852 41.31701 PRESENT
## 3: Pinus balfouriana -122.9044 40.94235 PRESENT
## 4: Pinus balfouriana -122.9107 41.19077 PRESENT
## 5: Pinus balfouriana -122.6497 40.65773 PRESENT
## 6: Pinus balfouriana -122.7857 41.21531 PRESENT
## coordinateUncertaintyInMeters institutionCode
## 1: 27816 iNaturalist
## 2: NA iNaturalist
## 3: 77 iNaturalist
## 4: 5 iNaturalist
## 5: 7880 iNaturalist
## 6: 5 iNaturalist
## references
## 1: https://www.inaturalist.org/observations/81808167
## 2: https://www.inaturalist.org/observations/87043552
## 3: https://www.inaturalist.org/observations/86123914
## 4: https://www.inaturalist.org/observations/85415690
## 5: https://www.inaturalist.org/observations/44672814
## 6: https://www.inaturalist.org/observations/51134921
Just take a quick look at the raw observations plotted by latitude and longitude.
Plot all of the species on the California map.
maps::map(database = "state", region = "california")
points(tree_df[ , c("decimalLongitude", "decimalLatitude")], pch = ".", col = "blue", cex = 3)
Plot all of the species using ggplot in case you want to visualize the species with something a bit fancier.
mm_species_plot1 <- ggplot(tree_df, aes(x=decimalLongitude, y = decimalLatitude, color = species)) +
geom_point() + labs(color = "Species", title = "MM Zone")
mm_species_plot1
There is a lot more you can do for GBIF, but these notes should help for the purpose of mapping species occurrence. Now that the data is somewhat organized we can start doing some proper data cleaning and exploratory data analysis in a future post.