May 19, 2020
The COVID-19 pandemic has dramatically impacted all of our lives in a very short period of time. Spring and summer are usually very busy as students prepare to go the field to engage in various data collection efforts. The pandemic has also disrupted these carefully planned activities as travel is suspended and local and remote field stations have closed indefinitely. A lost field season can be a major setback for a dissertation timeline and students will have to improvise. One promising opportunity to continue research efforts during these unprecedented times is taking advantage of the massive amounts of open scientific data that are freely available. Open data can form the basis of a review, synthesis, or new research.
Inspired by tweets from Ethan White about “PhD research from a distance”, the rOpenSci team did an in-depth exploration of how we provide access to open data. Our goal is to inspire students to find research opportunities with open data and highlight some of the rOpenSci packages that already make programmatic access possible. We also highlight some examples of how specific collections of packages are being used right now in fields as varied as archaeology and climate science.
Data are fundamental to scientific discovery and leveraging new discoveries would not be possible without access to data 1.
Although people rarely develop new research entirely on open data, these datasets provide an opportunity to reproduce and validate existing results, improve models, and be combined with other data to generate new syntheses.
The open science movement has been growing for over a decade and all of that interest has surfaced numerous databases and repositories. The growing interest in reproducibility has also led to the creation of a plethora of open source software to access such data.
rOpenSci’s core mission is to develop such tools and to date we have built over 120 robust data-access packages.
These packages provide access to an impressive variety and quantity of data:
eBird offers up 700 million observations, Crossref has 108 million records of scholarly works which include articles and books, Dryad makes available 13 terabytes of data associated with published papers, and GBIF has over 1.3 billion records of species worldwide.
We hope that this post and these tools provide inspiration for you to explore new data sources and research topics.
Many of rOpenSci’s tools are developed by practicing scientists and have strong communities behind them. We invited university faculty from our community of developer-researchers to highlight sources of open data for research in their fields.
Brooke Anderson, Colorado State University
Research on weather and climate—and their impacts on humans and the environment—can draw on numerous excellent open data sources, including many made available through programmatic access to data collected and shared by institutions and monitoring networks. The US Geological Survey offers a particular exciting example, offering not only APIs for accessing their data, but also a full suite of R packages developed and shared through the USGS-R community. rOpenSci’s own rnoaa package provides access to data through a number of the US National Oceanic and Atmospheric Administration’s open data APIs, allowing for fast and convenient access from R to national or worldwide data on, among others, meteorological observations, sea ice, and tides and currents, while its bomrang package offers similar access to data from the Australian Government Bureau of Meteorology. Other rOpenSci packages provide access to weather- and climate-related data from the Iowa Environment Mesonet (riem), New Zealand’s National Climate Database (clifro), the US National Aeronautics and Space Administration’s Prediction of Worldwide Energy Resource (POWER) dataset (nasapower), the US National Centers for Environmental Information’s Global Surface Summary of the Day (GSOD) dataset (GSODR), the US National Hurricane Center (rrricanes), the Flanders Environment Agency and Flanders Hydraulics Research’s waterinfo.be dataset (wateRinfo), and Environment and Climate Change Canada (ECCC) (weathercan). bowerbird is general-purpose package for maintaining local copies of a range of satellite- and model-derived environmental and climate data.
Louise Slater, University of Oxford, Sam Zipper, University of Kansas, Ilaria Prosdocimi, Ca ‘Foscari University, Sam Albers, Government of British Columbia, and Claudia Vitolo, European Centre for Medium Range Weather Forecasts
In hydrology, there has been a rapid growth in the number of streamflow data archives made publicly available online by countries such as the UK (rnrfa package), USA (dataRetrieval package), Greece (rOpenSci’s hydroscoper package), and Canada (rOpenSci’s tidyhydat package) although most countries sadly do not yet apply an open policy to their hydrological data. The Task View on Hydrological Data and Modelling and accompanying blog post Getting your toes wet in R: Hydrology, meteorology, and more provide an exciting overview of the most up-to-date R packages that are available for downloading, analysing, and modelling these data. For an overview of the many advantages of using R for hydrological research, see the paper “Using R in Hydrology” 2 which describes approaches to retrieve, analyse, map, model, and visualise hydrological data.
Ben Raymond, Australian Antarctic Division and Anton Van de Putte, Royal Belgian Institute for Natural Science
Antarctic science has a strong culture of open data - the Antarctic treaty itself states that scientific observations and results from Antarctica should be openly shared, and the Scientific Committee on Antarctic Research has had an active data management group since the late 1980s. To find Antarctic and Southern Ocean data, search the Antarctic master directory (metadata catalogue) or portals such as the Antarctic Biodiversity portal or the Southern Ocean Observing System.
The Antarctic rOpenSci community is developing R resources to support Antarctic and Southern Ocean science, with a particular emphasis on simplifying data access and performing common analytical tasks. See this blog post and task view for an overview of some of the packages in development, and the types of analyses that we are aiming to support.
Ben Marwick, University of Washington
Research shuddered to a stop in the Geoarchaeology Lab in early March, with UW being one of the first US campuses to switch to remote work. No longer able to go to campus, we turned our attention to computational text analysis of a large corpus of archaeological conference abstracts to look at questions about gender imbalance and theory change in our field. Our quick pivot to this new area was only possible thanks to high quality and well-documented software such as rOpenSci’s tesseract, pdftools and magick packages. These enabled us to generate data rapidly, giving us more time for exploring and testing hypotheses, and ensuring our students could get to the end of the term ready to share some really interesting results.
We’ve been keeping up with the literature through in-depth study of new journal articles, especially those that include open data. Archaeologists use specialised repositories such as the Digital Archaeological Record (tDAR), Open Context as well as several generic repositories to share data (e.g. Zenodo, Figshare, Dataverse - each of these have R packages to access data). There are R packages for accessing data hosted by those archaeology repositories (tdar, opencontext), but many of our favourite recent articles (we keep a list here) had their data openly archived on the Open Science Framework data repository. While studying these articles we have enjoyed using rOpenSci’s osfr package to quickly and reproducibly access these materials for in-depth exploration. A favourite type of data for many archaeologists is radiocarbon ages, and our group has also been working with these with ease thanks to the c14bazAAR package. We’ve been using this package to get data to study radiocarbon dates from hundreds of archaeological sites in Australia. While we’re missing the lab, rOpenSci’s packages for acquiring archaeological data have been invaluable tools for efficiently enabling us to be active and engaged in our research.
Our task view for archaeological science shows the full range of tools we use, from data acquisition through environmental and geological analysis to writing reproducible manuscripts.
Robin Lovelace, University of Leeds
There has never been a better time for data driven and reproducible transport research. The COVID-19 pandemic has disrupted transport patterns worldwide. This has led to changes, such as the construction of ‘pop-up’ active transport infrastructure, the prioritisation of which can be supported by reproducible and open data analysis, as outlined in preprint (the analysis of which was undertaken in R) on the topic 3. There is a wealth of data out there that can be found with careful search queries and many new datasets (like Uber’s micromobility datasets, released on May 6th of this year).
For downloading data representing transport networks, I recommend heading to the overpass website and for R users checking out osmdata and the in-development geofabric (to be renamed) R packages.
For open origin-destination data there are many resources but the PCT package provides a way to access national-scale datasets quickly from the R command line, as outlined stplanr’s Origin-destination vignette.
For road safety data there is a lack of open data in many countries but you can access national road casualty data, with 60+ variables and 100,000+ records each year with the stats19 package.
For links to additional resources I recommend Chapter 12 of Geocomputation with R and Chapter 11 of QGIS for transport researchers.
For inspiration, I recommend checking out the Propensity to Cycle Tool, an interactive free and open web app that is being used to inform active transport investment plans in dozens of cities across the UK (it also has many data download options at zone, route and route network levels).
rOpenSci has its roots in software for biodiversity research, with many packages in the areas of taxonomy, biological occurrences, and natural history/traits.
taxonomy: A good place to start is the taxonomy task view, covering many options for working with online taxonomy data
occurrences: Occurrence data forms the basis of much ecological research. The largest source of occurrence data, GBIF, can be accessed with the rgbif package. Many more are listed in the README for the package spocc.
natural history/traits: Conservation researchers may want to fetch data from the IUCN Red List via rredlist, Fishbase life history data from rfishbase, bird data from auk or rebird, or trait data from various marine taxa in WoRMS (called “attributes” by WoRMS; worrms).
A good general resource for rOpenSci packages on biodiversity is the rOpenSci Community Call from March 2019: Research Applications of rOpenSci Taxonomy and Biodiversity Tools.
Browse our table of > 100 data-access packages (under the bird) or jump ahead to see where you come in.
Lesser Violetear Colibri cyanotus. Carlos Sanchez, Macaulay Library | eBird.
The table below shows a subset of our full suite of R packages. You can find scientific use cases for a package on our main page by clicking on a package name.
R package | Data and source | Maintainer |
---|---|---|
antanym | Antarctic geographic names. Composite Gazetteer of Antarctica | Ben Raymond |
AntWeb | Ant data. AntWeb database from the California Academy of Sciences | Karthik Ram |
auk | bird sighting records. http://ebird.org | Matthew Strimas-Mackey |
bikedata | Historic ride data from public hire bicycle systems. London, U.K., from the U.S.A., San Francisco CA, New York City NY, Chicago IL, Washington DC, Boston MA, Los Angeles LA, Philadelphia PA, Minnesota, Montreal, Canada, and Guadalajara, Mexico. | Mark Padgham |
biomartr | genomic data retrieval. ‘NCBI RefSeq’, ‘NCBI Genbank’, ‘ENSEMBL’, and ‘UniProt’ databases, plus interface to ‘BioMart’ database | Hajk-Georg Drost |
bittrex | Bittrex crypto-currency exchange. https://bittrex.com | Michael Kane |
bold | Bold Systems for genetic barcode data. http://www.boldsystems.org | Scott Chamberlain |
brranching | phylogenetic data. ‘Phylomatic’ http://phylodiversity.net/phylomatic, and ‘Phylocom’ https://github.com/phylocom/phylocom | Scott Chamberlain |
camsRad | Time series of global, direct, and diffuse irradiations on horizontal surface. Copernicus Atmosphere Monitoring Service (CAMS) | Lukas Lundstrom |
ccafs | Climate Change, Agriculture, and Food Security (CCAFS) General Circulation Models. | Scott Chamberlain |
chromer | Chromosome Counts Database. http://ccdb.tau.ac.il | Paula Andrea Martinez |
clifro | New Zealand National Climate Database. https://cliflo.niwa.co.nz | Blake Seers |
comtradr | United Nations Comtrade data. https://comtrade.un.org/data | Chris Muir |
cRegulome | transcription factor/microRNA-gene correlations (co-expression) in cancer. Cistrome Cancer Liu et al. (2011) doi:10.1186/gb-2011-12-8-r83 and ‘miRCancerdb’ databases (in press). | Mahmoud Ahmed |
dbhydroR | South Florida Water Management Districts DBHYDRO’ database. https://www.sfwmd.gov/science-data/dbhydro | Joseph Stachelek |
DoOR.data | Drosophila odorant response data for DoOR.functions. | Daniel Münch |
ecoengine | Georeferenced specimen records from the University of California, Berkeley’s Natural History Museums. https://ecoengine.berkeley.edu | Karthik Ram |
epubr | reading and parsing of internal e-book content from EPUB files. EPUB e-books. | Matthew Leonawicz |
essurvey | European Social Survey data. http://www.europeansocialsurvey.org | Jorge Cimentada |
FedData | Geospatial data from several federated data sources (mainly sources maintained by the US federal government). National Elevation Dataset National Hydrography Dataset (USGS), The Soil Survey Geographic (SSURGO) database, the Global Historical Climatology Network (GHCN), the Daymet gridded estimates of daily weather parameters, the International Tree Ring Data Bank, and the National Land Cover Database (NLCD). | R. Kyle Bocinsky |
fingertipsR | Data for many indicators of public health in England. http://fingertips.phe.org.uk | Sebastian Fox |
genderdata | Historical datasets of first names and dates of birth. | Lincoln Mullen |
getCRUCLdata | University of East Anglia Climate Research Unit gridded climatology of monthly means. https://crudata.uea.ac.uk/cru/data/hrg/tmc/readme.txt | Adam Sparks |
getlandsat | Landsat 8 Data. https://registry.opendata.aws/landsat-8 | Scott Chamberlain |
GSODR | Global Surface Summary of the Day (GSOD) weather data from USA National Centers for Environmental Information (NCEI). http://www1.ncdc.noaa.gov/pub/data/gsod/readme.txt | Adam Sparks |
gtfsr | public GTFS feeds. | Danton Noriega-Goodwin |
gutenbergr | Project Gutenberg collection. http://www.gutenberg.org | David Robinson |
hathi | HathiTrust bibliographic API. https://www.hathitrust.org | Scott Chamberlain |
hddtools | hydrological data. various data providers | Claudia Vitolo |
helminthR | London Natural History Museum’s host-parasite database. http://www.nhm.ac.uk/research-curation/scientific-resources/taxonomy-systematics/host-parasites | Tad Dallas |
historydata | sample data sets for historians on population, institutional, religious, military, and prosopographical data. | Lincoln Mullen |
hydroscoper | Greek National Data Bank for Hydrological and Meteorological Information. http://www.hydroscope.gr | Konstantinos Vantas |
internetarchive | Internet Archive. https://archive.org/ | Lincoln Mullen |
isdparser | NOAA Integrated Surface Data. https://www.ncdc.noaa.gov/isd | Scott Chamberlain |
jaod | Directory of Open Access Journals. https://doaj.org | Scott Chamberlain |
MODIStsp | time series of rasters from MODIS Satellite Land Products data. | Lorenzo Busetto |
musemeta | museum metadata. Many different museums, including the MET, Getty Museum, and more | Scott Chamberlain |
nasapower | NASA POWER (Prediction Of Worldwide Energy Resource) global meteorology and surface solar energy climatology data. https://power.larc.nasa.gov | Adam H. Sparks |
natserv | NatureServe. https://www.natureserve.org | Scott Chamberlain |
neotoma | paleoecological datasets from the Neotoma Paleoecological Database. http://api.neotomadb.org | Simon J. Goring |
nomisr | UK official statistics from the Nomis database, including data from the from the Census, the Labour Force Survey, DWP benefit statistics and other economic and demographic data from the Office for National Statistics. https://www.nomisweb.co.uk/api/v01/help | Evan Odell |
onekp | Transcriptomes of over 1000 plant species.. The 1000 Plants Initiative (www.onekp.com) | Zebulun Arendsee |
opencontext | Open Context data. https://opencontext.org | Ben Marwick |
originr | Species origin data from multiple sources. Encyclopedia of Life (http://eol.org), Flora ‘Europaea’ (http://rbg-web2.rbge.org.uk/FE/fe.html), Global Invasive Species Database (http://www.iucngisd.org/gisd), the Native Species Resolver (http://bien.nceas.ucsb.edu/bien/tools/nsr/), Integrated Taxonomic Information Service (http://www.itis.gov/), and Global Register of Introduced and Invasive Species (http://www.griis.org/). | Scott Chamberlain |
osmdata | OpenStreetMap data. https://openstreetmap.org | Mark Padgham |
ots | Ocean time series datasets, including BATS, HOT, and more. | Scott Chamberlain |
paleobioDB | PaleobioDB fossil data. http://paleobiodb.org/data1.1 | Sara Varela |
pangaear | Pangaea Database. https://www.pangaea.de | Scott Chamberlain |
phylotaR | Orthologous sequence clusters within taxonomic groups from GenBank. https://www.ncbi.nlm.nih.gov/genbank | Dom Bennett |
pleiades | Pleiades data. https://pleiades.stoa.org | Scott Chamberlain |
prism | Oregon State Prism climate data. http://www.prism.oregonstate.edu/ | Alan Butler |
qualtRics | Survey results from the Qualtrics API. https://www.qualtrics.com/about | Julia Silge |
rAvis | proyectoavis database. http://proyectoavis.com | Sara Varela |
rbace | Bielefeld Academic Search Engine (BASE) of more than 150 million scholarly documents from more than 7000 sources. https://www.base-search.net | Scott Chamberlain |
rbhl | Biodiversity Heritage Library (BHL) of digitized literature on biodiversity studies. https://www.biodiversitylibrary.org | Scott Chamberlain |
rbison | USGS BISON database for species occurrence data from the United States. https://bison.usgs.gov | Scott Chamberlain |
rbraries | Libraries.io data from 36 different package managers for programming languages. https://libraries.io/api | Scott Chamberlain |
rcoreoa | CORE API aggregates open access research outputs from repositories and journals. https://core.ac.uk/docs | Scott Chamberlain |
rdatacite | DataCite metadata. https://www.datacite.org | Scott Chamberlain |
rdataretriever | Data Retriever. http://data-retriever.org | Henry Senyondo |
rdefra | DEFRA’s UK-AIR website. https://uk-air.defra.gov.uk | Claudia Vitolo |
rdopa | DOPA (Digital Observatory for protected Areas) by the European Union Joint Research Centre. | Joona Lehtomaki |
rdryad | Dryad \Solr\ data underlying scientific publications. https://datadryad.org | Scott Chamberlain |
rebird | eBird database of bird observations and locations. https://ebird.org/home | Sebastian Pardo |
rentrez | NCBIs EUtils API for databases like GenBank and PubMed’. https://www.ncbi.nlm.nih.gov/genbank https://www.ncbi.nlm.nih.gov/pubmed | David Winter |
rerddap | ERDDAP servers. https://upwell.pfeg.noaa.gov/erddap/information.html | Scott Chamberlain |
rfishbase | Fishbase data on over 30,000 species of fish, their biology, ecology, morphology and more. http://www.fishbase.org http://www.sealifebase.org | Carl Boettiger |
rfisheries | openfisheries.org. http://www.openfisheries.org/ | Karthik Ram |
rfna | Flora of North America website data. http://www.efloras.org | Scott Chamberlain |
rgbif | Global Biodiversity Information Facility (GBIF) data of species occurrence. https://www.gbif.org/developer/summary | Scott Chamberlain |
rglobi | Global Biotic Interactions (GloBI) data on spatial-temporal species interactions. https://www.globalbioticinteractions.org/ | Jorrit Poelen |
rgpdd | Global Population Dynamics Database. https://ecologicaldata.org/wiki/global-population-dynamics-database | Carl Boettiger |
riem | Weather data from Automated Surface Observing System (ASOS) stations. Iowa Environment Mesonet website. | Maëlle Salmon |
rif | Neuroscience Information Framework (NIF) data. https://neuinfo.org | Scott Chamberlain |
rinat | iNaturalist website of species occurrence data submitted by citizen scientists.. http://inaturalist.org | Stéphane Guillou |
rnaturalearthdata | Vector map data. http://www.naturalearthdata.com | Andy South |
rnoaa | Many NOAA data sources including NCDC climate data, and data on sea ice, severe weather, historical metadata, storm and tornado data. https://www.ncdc.noaa.gov/cdo-web/webservices/v2 | Scott Chamberlain |
rnpn | National Phenology Network data on various life history events that occur at specific times. https://usanpn.org | Scott Chamberlain |
ropenaq | air quality data from the OpenAQ platform. https://docs.openaq.org | Maëlle Salmon |
rotl | Open Tree of Life data on phylogenetic trees. https://tree.opentreeoflife.org/ | Francois Michonneau |
rperseus | Perseus Digital Library collection of classical texts. http://cts.perseids.org | David Ranzolin |
rppo | Global Plant Phenology Data Portal. https://www.plantphenology.org | John Deck |
rredlist | IUCN Red List of threatened and endangered species. http://apiv3.iucnredlist.org/api/v3/docs | Scott Chamberlain |
rrricanes | Data on past and current hurricanes and tropical storms for the Atlantic and eastern Pacific oceans. https://www.nhc.noaa.gov/archive/1998/1998archive.shtml | Tim Trice |
rrricanesdata | Storm discussions, forecast/advisories, public advisories, wind speed probabilities, strike probabilities and more. National Hurricane Center | Tim Trice |
rsnps | SNP datasets for SNPs, genotypes, and phenotypes. https://opensnp.org https://www.ncbi.nlm.nih.gov/projects/SNP | Julia Gustavsen |
rusda | United States Department of Agriculture (USDA) data from the Systematic Mycology and Microbiology Laboratory (SMML). | Franz-Sebastian Krah |
rvertnet | VertNet.org archives including taxonomic names, places, and dates. http://vertnet.org | Scott Chamberlain |
rWBclimate | Model predictions from 15 different global circulation models in 20 years. | Edmund Hart |
skynet | air transport statistics from the Bureau of Transport Statistics (BTS) in the United States. https://www.transtats.bts.gov/databases.asp?Mode_ID=1&Mode_Desc=Aviation&Subject_ID2=0 | Filipe Teixeira |
smapr | NASA Soil Moisture Active Passive (SMAP) data. https://smap.jpl.nasa.gov/ | Maxwell Joseph |
solrium | data from Solr. https://lucene.apache.org/solr | Scott Chamberlain |
spocc | species occurrence data sources, including Global Biodiversity Information. | Scott Chamberlain |
suppdata | Supplementary materials from published manuscripts,. | William D. Pearse |
tidyhydat | Historical and real-time national hydrometric data from Water Survey of Canada data sources. http://dd.weather.gc.ca/hydrometric/csv http://collaboration.cmc.ec.gc.ca/cmc/hydrometrics/www | Sam Albers |
tradestatistics | Access Open Trade Statistics API from R to download international trade data.. | Mauricio Vargas |
traits | Species trait data from many different sources, including sequence data from from NCBI, plant trait data from BETYdb, plant data from the USDA plants database, data from EOL Traitbank, Coral traits data, Birdlife International, and more.. | Scott Chamberlain |
treebase | TreeBASE repository of phylogenetic trees (of species, population, or genes). http://treebase.org | Carl Boettiger |
USAboundaries | Boundaries for geographical units in the United States of America. U.S. Census Bureau, Newberry Library’s ‘Atlas of Historical County Boundaries’ | Lincoln Mullen |
USAboundariesData | Higher resolution boundary data, for use in the USAboundaries package.. U.S. Census Bureau, the Newberry Library’s ‘Historical Atlas of U.S. County Boundaries’, and Erik Steiner’s ‘United States Historical City Populations, 1790-2010’. | Lincoln Mullen |
weathercan | Historical weather data from Environment and Climate Change Canada. http://climate.weather.gc.ca/historical_data/search_historic_data_e.html | Steffi LaZerte |
webchem | Chemical information from around the web.. | Tamás Stirling |
Have you successfully used one or more of these data sources in your research? We want others to imagine what’s possible by seeing examples. Share your story in the comments and cite your paper or preprint if it’s published.
Is there a data source you want to access programmatically but there’s no R package to do that? Tell us about it in the comments.
Need help? Ask in our discussion forum and we’ll do our best to get you answers.
Tierney, N. J., & Ram, K. (2020). A Realistic Guide to Making Data Available Alongside Code to Improve Reproducibility. arXiv preprint arXiv:2002.11626. https://arxiv.org/abs/2002.11626 ↩︎
Slater, L. J., Thirel, G., Harrigan, S., Delaigue, O., Hurley, A., Khouakhi, A., Prosdocimi, I., Vitolo, C., & Smith, K. (2019). Using R in hydrology: a review of recent developments and future directions. Hydrology and Earth System Sciences, 23(7), 2939-2963. https://www.hydrol-earth-syst-sci.net/23/2939/2019/ ↩︎
Lovelace, R., Morgan, M., Talbot, J., & Lucas-Smith, M. (2020, May 11). Methods to prioritise pop-up active transport infrastructure. https://doi.org/10.31219/osf.io/7wjb6 ↩︎