rOpenSci | What birds are observed near Radolfzell? Bird occurrence data in R

What birds are observed near Radolfzell? Bird occurrence data in R

Thanks to the first post of the series we know where to observe birds near Radolfzell’s Max Planck Institute for Ornithology, so we could go and do that! Or we can stay behind our laptops and take advantage of eBird, a fantastic bird sightings aggregator! As explained by Matt Strimas-Mackey in his recent blog post, “The eBird database currently contains over 500 million records of bird sightings, spanning every country and over 98% of species, making it an extremely valuable resource for bird research and conservation.”.

Luckily for us, there are no less than two rOpenSci packages giving us access to eBird data! In this blog post, I shall play with both of them, highlighting their respective strengths, while discovering what birds are observed in the area.

How to access eBird data?

There are two ways to access eBird data with an R package for each of these methods,

Your use case will help you decide which entry point is the most appropriate for your use case. Note that both packages have documented their respective applications in order to help potential users: rebird README, auk README.

  • You want to study a region, or a bird, quite deeply and you even want absence/presence data, not only presence data. Use auk!

  • You want to build a tool based on recent observations only or you want to get a quick taste of eBird’s data. Use rebird!

  • A bit provocatively, do you want birds data only? If not, maybe you’ll need a combination of auk/rebird and another package. Check out this list of data providers covered by spocc, umbrella package for rOpenSci’s packages accessing occurrence data. Many data sources actually end up in GBIF datasets, eBird seems to upload their data there once a year.

  • You want to analyze your eBird’s sightings? Check out the work-in-progress myebird by Sebastian Pardo, rebird’s maintainer, and this app by Simón Valdez-Juarez and Sebastian Pardo highlighting the most endangered species you observed.

  • You’re writing a birder’s guide to rOpenSci? Use both rebird and auk to show them off!

How to get access to eBird’s data

Whole eBird dataset, quarterly updated

One needs to first create an eBird account and then request access to the data. Once one has gotten green light from eBird (in my case a few days following my request), after a small dance of joy it’s time to head to eBird’s download page. If one doesn’t want nor need to download the whole eBird Basic Dataset (EBD), one can request a custom download, which I did, asking for only the data for Germany which I got after a few days (the time to receiving the link to download a custom dataset is variable). While waiting, I worked on the rebird part of this post, among other things.

API key? Not yet

At the moment, rebird interfaces the version 1.1 eBird APIs that will be retired “at some point in the future”. When this happens, the rebird package will use the new API which will mean you’ll need an API key. Currently, though, you don’t need any authentication to use rebird.

Using rebird while waiting for the eBird’s full dataset

In the following, we’ll use the rOpenSci’s package rebird to get and map all observations in the last 30 days near Radolfzell in Germany.

The Radolfzell part of that sentence is a bit different than in the last post about finding bird hides near the MPI institute for ornithology: I want all observations inside the polygon of the district of Constance (Landkreis Konstanz, including Radolfzell… and a protected natural area!) so I’ll first need to get it. For doing that I’ll use osmdata::getbb, that uses the free Nominatim API provided by Openstreetmap.

library("sf")
landkreis_konstanz <- osmdata::getbb("Landkreis Konstanz",
                             format_out = "sf_polygon")

plot(landkreis_konstanz)

Limits of the County ofConstance

Neither rebird nor spocc currently offer built-in trimming of occurrence data to a polygon (whereas osmdata does). A further difficulty created by eBird’s API is that it doesn’t allow for the use of a bounding box, but instead demands a lat, lng and a dist defining the radius of interest from given lat/lng in kilometers. Thanks to Marco Sciaini for providing me with an easy way to compute dist, using the sf package.

coord <- sf::st_coordinates(landkreis_konstanz)

bbox <- c(x1 = min(coord[, "X"]),
          x2 = max(coord[, "X"]),
          y1 = min(coord[, "Y"]),
          y2 = max(coord[, "Y"]))


center <- c(x = (bbox["x1"] + bbox["x2"])/2,
            y = (bbox["y1"] + bbox["y2"])/2)

dist <- landkreis_konstanz %>%
  sf::st_cast("POINT") %>%
  sf::st_distance() %>% 
  max() * 0.5

dist
## 24129.15 m

Now, we can make the query.

birds <- rebird::ebirdgeo(species = NULL,
                          lng = center["x.x1"],
                          lat = center["y.y1"],
                          back = 30,
                          dist = as.numeric(
                            units::set_units(dist, "km")))
nrow(birds)
## [1] 55
str(birds)
## Classes 'tbl_df', 'tbl' and 'data.frame':    55 obs. of  12 variables:
##  $ lng            : num  8.94 8.94 8.94 8.94 8.94 ...
##  $ locName        : chr  "Radolfzeller Aachmündung (Bodensee)" "Radolfzeller Aachmündung (Bodensee)" "Radolfzeller Aachmündung (Bodensee)" "Radolfzeller Aachmündung (Bodensee)" ...
##  $ sciName        : chr  "Chroicocephalus ridibundus" "Motacilla alba" "Rallus aquaticus" "Aythya fuligula" ...
##  $ obsValid       : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ locationPrivate: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ obsDt          : chr  "2018-08-08 13:30" "2018-08-08 13:30" "2018-08-08 13:30" "2018-08-08 13:30" ...
##  $ obsReviewed    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ comName        : chr  "Black-headed Gull" "White Wagtail" "Water Rail" "Tufted Duck" ...
##  $ lat            : num  47.7 47.7 47.7 47.7 47.7 ...
##  $ locID          : chr  "L3314048" "L3314048" "L3314048" "L3314048" ...
##  $ locId          : chr  "L3314048" "L3314048" "L3314048" "L3314048" ...
##  $ howMany        : int  NA 2 1 1 NA 3 NA NA 8 20 ...

Now that we have the occurrence data, let’s plot it to see whether trimming is required.

crs <- sf::st_crs(landkreis_konstanz)

birds_sf <- sf::st_as_sf(birds,
                         coords = c("lng", "lat"), 
                         crs = crs)
library("ggplot2")
ggplot() +
  geom_sf(data = landkreis_konstanz) +
  geom_sf(data = birds_sf) +
  theme(legend.position = "bottom") +
  hrbrthemes::theme_ipsum() +
  ggtitle("eBird observations over the last 30 days",
          subtitle = "Observations within a circle around the County of Constance")

map of raw observations within acircle

Yes, trimming is required! It’d have been too bad not to learn how to do it, anyway. We also add the MPI to the map.

# which parts of the oject are in the county
in_indices <- sf::st_within(birds_sf, landkreis_konstanz)

# filter them
trimmed_birds <- dplyr::filter(birds_sf,
                               lengths(in_indices) > 0)

# summarize to get no. of birds by  location
summarized_birds <- trimmed_birds %>%
  dplyr::group_by(locName) %>%
  dplyr::summarise(n = n())

# MPI 
mpi <- opencage::opencage_forward("Am Obstberg 1 78315 Radolfzell", 
                                  limit = 1)$results

coords <- data.frame(lon = mpi$geometry.lng,
                     lat = mpi$geometry.lat)

crs <- sf::st_crs(landkreis_konstanz)

mpi_sf <- sf::st_as_sf(coords,
                       coords = c("lon", "lat"), 
                       crs = crs)

# Map!
ggplot() +
  geom_sf(data = landkreis_konstanz) +
  geom_sf(data = summarized_birds,
          aes(size = n), show.legend = "point") +
  hrbrthemes::theme_ipsum() +
  ggtitle("eBird observations over the last 30 days",
          subtitle = "County of Constance, MPI as a triangle") +
  geom_sf(data = mpi_sf,
          shape = 2) 

trimmed observations in thecounty

We got 49 observations (nrow(trimmed_birds)) of 49 species (length(unique(trimmed_birds$comName))), over 2 places (length(unique(trimmed_birds$locName))) during 5 observation sessions. Hopefully merely an appetizer to what we can get from using the full eBird dataset in the next section…

Note that the initial query could have been made with spocc which would have helped using the rOpenSci occurrence suite.

birds2 <- spocc::occ(from = "ebird",
                     ebirdopts = list(method = "ebirdgeo",
                                      species = NULL,
                                      lng = center["x.x1"],
                                      lat = center["y.y1"],
                                      back = 30,
                                      dist = as.numeric(
                                        units::set_units(dist, "km"))))
                     
mapr::map_leaflet(birds2)
mapr leaflet map of observations locations

Quite handy!

Now, let’s explore the whole eBird dataset for Germany.

Using auk to process EBD dataset for Germany

After getting access to a custom dataset corresponding to the EBD for Germany only, I used auk’s documentation and this post to learn how to process it. Since I wasn’t planning on zero-filling the data to get presence/absence counts, I was able to ignore the sampling event data that contains the checklist-level information (e.g. time and date, location, and search effort information). For an example of a more advanced auk workflow involving the full EBD, and sampling data, refer to Matt Strimas-Mackey’s own blog post about his package.

Preparing the dataset

Here, the workflow is to clean the data and to filter it using one of auk’s built-in filters and then polygon filtering as earlier in this post. All steps are quite fast, because the custom dataset for Germany isn’t too big (a few hundred megabytes).

Cleaning happens in the following:

ebd_dir <- "C:/Users/Maelle/Documents/ropensci/ebird"

f <- file.path(ebd_dir, "ebd_DE_relMay-2018.txt")
f_clean <- file.path(ebd_dir, "ebd_DE_relMay-2018_clean.txt")
auk::auk_clean(f, f_out = f_clean, remove_text = TRUE)

Then one can filter the data. Note that the auk_extent function that only retains observations within a bounding box has been renamed auk_bbox in the dev version of auk, the old name will be deprecated soon.

ebd_dir <- "C:/Users/Maelle/Documents/ropensci/ebird"
f_in_ebd <- file.path(ebd_dir, "ebd_DE_relMay-2018_clean.txt")

library("magrittr")
landkreis_konstanz_coords <- sf::st_coordinates(landkreis_konstanz)

ebd_filter <- auk::auk_ebd(f_in_ebd) %>% 
  auk::auk_extent(c(min(landkreis_konstanz_coords[, "X"]),
                    min(landkreis_konstanz_coords[, "Y"]), 
                    max(landkreis_konstanz_coords[, "X"]), 
                    max(landkreis_konstanz_coords[, "Y"])))
ebd_filter
## Input 
##   EBD: C:\Users\Maelle\Documents\ropensci\ebird\ebd_DE_relMay-2018_clean.txt 
## 
## Output 
##   Filters not executed
## 
## Filters 
##   Species: all
##   Countries: all
##   States: all
##   BCRs: all
##   Spatial extent: Lon 8.6 - 9.2; Lat 47.7 - 47.9
##   Date: all
##   Start time: all
##   Last edited date: all
##   Protocol: all
##   Project code: all
##   Duration: all
##   Distance travelled: all
##   Records with breeding codes only: no
##   Complete checklists only: no
fs::dir_create("ebird")
f_out_ebd <- "ebird/ebd_lk_konstanz.txt"
f_out_sampling <- "ebird/ebd_lk_konstanz_sampling.txt"
ebd_filtered <- auk::auk_filter(ebd_filter, file = f_out_ebd,
                                overwrite = TRUE)

On top of this filtering with auk, after loading the data we filter observations inside the polygon of the county.

crs <- sf::st_crs(landkreis_konstanz)

ebd <- auk::read_ebd(f_out_ebd) %>%
  sf::st_as_sf(coords = c("longitude", "latitude"), 
                crs = crs) 

in_indices <- sf::st_within(ebd, landkreis_konstanz)

ebd <- dplyr::filter(ebd, lengths(in_indices) > 0)

ebd <- as.data.frame(ebd)

What are the observed birds?

Before looking at species names, let’s have a brief look at the size and temporal extent of the data.

library("ggplot2")

dim(ebd)
## [1] 10156    41
ebd %>%
  dplyr::mutate(year = lubridate::year(observation_date)) %>%
ggplot() +
  geom_bar(aes(year))  +
  hrbrthemes::theme_ipsum(base_size = 12, axis_title_size = 12, axis_text_size = 12) +
  ylab("No. of eBird observations") +
  xlab("Time (years)") +
  ggtitle("Full eBird dataset for the County of Constance")

No. of eBird observations over theyears

eBird started in 2002 but only became global in 2010. It allows people to enter older observations, though.

Now we can look at what birds have been reported the most.

ebd %>%
  dplyr::filter(approved) %>%
  dplyr::count(scientific_name, common_name) %>%
  dplyr::arrange(- n) %>%
  head(n = 10) %>%
  knitr::kable()
scientific_name common_name n
Corvus corone Carrion Crow 288
Turdus merula Eurasian Blackbird 285
Anas platyrhynchos Mallard 273
Fulica atra Eurasian Coot 268
Parus major Great Tit 266
Podiceps cristatus Great Crested Grebe 254
Ardea cinerea Gray Heron 236
Cygnus olor Mute Swan 234
Cyanistes caeruleus Eurasian Blue Tit 233
Chroicocephalus ridibundus Black-headed Gull 223

I had to google most of them, but only because I didn’t know the scientific and English names of these birds: they’re birds even I, not a birder, know, probably because they’re also common in Brittany where I grew up.

We can also look at birds whose observation was rejected. Out of 10156 observations only 64 were reviewed, and only 5 were not approved.

ebd %>%
  dplyr::select(scientific_name, common_name,
               approved, reviewed, reason) %>%
  dplyr::filter(!approved) %>%
  knitr::kable()
scientific_name common_name approved reviewed reason
Cygnus atratus Black Swan FALSE TRUE Species-Introduced/Exotic
Cygnus atratus Black Swan FALSE TRUE Species-Introduced/Exotic
Cygnus atratus Black Swan FALSE TRUE Species-Introduced/Exotic
Oxyura leucocephala White-headed Duck FALSE TRUE Species-Introduced/Exotic
Mareca sibilatrix Chiloe Wigeon FALSE TRUE Species-Introduced/Exotic

Black Swans are mostly present in Australia, imported and escaped in a few other places but eBird mostly doesn’t accept the entry of exotic species although it’s debated. In any case, eBird’s curation of the data entered is quite admirable.

Who observed birds?

In one of his latest blog posts Scott Chamberlain mentioned the legendary Lowell Ahart, super plant collector in Butte County, California. Does the county of Constance have a super birder?

(first_birder <- ebd %>%
  dplyr::count(observer_id) %>%
  dplyr::arrange(- n) %>%
  head(n = 1) )
## # A tibble: 1 x 2
##   observer_id     n
##   <chr>       <int>
## 1 obsr457108   3551
(proportion <- round(first_birder$n/nrow(ebd),
                    digits = 2))
## [1] 0.35

Wow, that person made 35% of eBird observations in the county! The EBD no longer provides names (consequence of the EU General Data Protection Regulation) but from the checklist ID one can get access to the checklist page e.g this one where the name of the observer is present. The super birder of the County of Constance is Antonio Anta Bink.

Conclusion

R packages for occurrence data

In this post I gave a rough view of what birds are present in the county around Radolfzell: Eurasian Blackbirds, Carrion Crows, Great Tits… but not Black Swans in eBird’s data. We mostly illustrated the use of two R packages accessing eBird’s data:

  • auk for processing the gigantic whole eBird’s dataset.

  • rebird for getting access to recent data via an API. rebird is part of a larger collection of packages for occurrence data within rOpenSci’s suite, with spocc being an umbrella package accessing several data sources; scrubr a helper for cleaning data obtained this way; and mapr a utility package for mapping such data.

Explore these packages, and more of rOpenSci’s suite, by checking out our packages page!

More birding soon!

Stay tuned for the next post in this series, that’ll mark a break from modern data since we’ll try to extract information from old natural history bird drawings! After that, in a following post we’ll come back to the occurrence data obtained from eBird in order to complement it with open taxonomic and traits data. In the meantime, happy (e)birding!