Full Text of Scholarly Articles Across Many Data Sources
Provides a single interface to many sources of full text scholarly data, including Biomed Central, Public Library of Science, Pubmed Central, eLife, F1000Research, PeerJ, Pensoft, Hindawi, arXiv preprints, and more. Functionality included for searching for articles, downloading full or partial text, downloading supplementary materials, converting to various data formats.
Scientific use casesConvert Among Citation Formats
Converts among many citation formats, including BibTeX, Citeproc, Codemeta, RDF XML, RIS, Schema.org, and Citation File Format. A low level R6 class is provided, as well as stand-alone functions for each citation format for both read and write.
View DocumentationAn R Client to the PatentsView API
Provides functions to simplify the PatentsView API (http://www.patentsview.org/api/doc.html) query language, send GET and POST requests to the API’s seven endpoints, and parse the data that comes back.
View DocumentationAccess and Search MedRxiv and BioRxiv Preprint Data
An increasingly important source of health-related bibliographic content are preprints - preliminary versions of research articles that have yet to undergo peer review. The two preprint repositories most relevant to health-related sciences are medRxiv https://www.medrxiv.org/ and bioRxiv https://www.biorxiv.org/, both of which are operated by the Cold Spring Harbor Laboratory. medrxivr provides programmatic access to the Cold Spring Harbour Laboratory (CSHL) API https://api.biorxiv.org/, allowing users to easily download medRxiv and bioRxiv preprint metadata (e.g. title, abstract, publication date, author list, etc) into R. medrxivr also provides functions to search the downloaded preprint records using regular expressions and Boolean logic, as well as helper functions that allow users to export their search results to a .BIB file for easy import to a reference manager and to download the full-text PDFs of preprints matching their search criteria.
View DocumentationInterface to the Search API for PLoS Journals
A programmatic interface to the SOLR based search API (http://api.plos.org/) provided by the Public Library of Science journals to search their articles. Functions are included for searching for articles, retrieving articles, making plots, doing faceted searches, highlight searches, and viewing results of highlighted searches in a browser.
Scientific use casesLinguistic Typology and Mapping
Provides R with the Glottolog database https://glottolog.org/ and some more abilities for purposes of linguistic mapping. The Glottolog database contains the catalogue of languages of the world. This package helps researchers to make a linguistic maps, using philosophy of the Cross-Linguistic Linked Data project https://clld.org/, which allows for while at the same time facilitating uniform access to the data across publications. A tutorial for this package is available on GitHub pages https://docs.ropensci.org/lingtypology/ and package vignette. Maps created by this package can be used both for the investigation and linguistic teaching. In addition, package provides an ability to download data from typological databases such as WALS, AUTOTYP and some others and to create your own database website.
Scientific use casesClient for Various CrossRef APIs
Client for various CrossRef APIs, including metadata search with their old and newer search APIs, get citations in various formats (including bibtex, citeproc-json, rdf-xml, etc.), convert DOIs to PMIDs, and vice versa, get citations for DOIs, and get links to full text of articles when available.
Scientific use casesFetch Scholary Full Text from Crossref
Text mining client for Crossref (https://crossref.org). Includes functions for getting getting links to full text of articles, fetching full text articles from those links or Digital Object Identifiers (DOIs), and text extraction from PDFs.
View DocumentationClient for Citoid
Client for Citoid (https://www.mediawiki.org/wiki/Citoid), an API for getting citations for various scholarly work identifiers found on Wikipedia.
View DocumentationMicrosoft Academic API Client
The Microsoft Academic Knowledge API provides programmatic access to scholarly articles in the Microsoft Academic Graph (https://academic.microsoft.com/). Includes methods matching all ‘Microsoft Academic’ API routes, including search, graph search, text similarity, and interpret natural language query string.
View DocumentationHigh-Performance Stemmer, Tokenizer, and Spell Checker
Low level spell checker and morphological analyzer based on the famous hunspell library https://hunspell.github.io. The package can analyze or check individual words as well as parse text, latex, html or xml documents. For a more user-friendly interface use the spelling package which builds on this package to automate checking of files, documentation and vignettes in all common formats.
Scientific use casesFetch Sections of XML Scholarly Articles
Get chunks of XML scholarly articles without having to know how to work with XML. Custom mappers for each publisher and for each article section pull out the information you want. Works with outputs from package fulltext, xml2 package documents, and file paths to XML documents.
View DocumentationInterface to the Orcid.org API
Client for the Orcid.org API (https://orcid.org/). Functions included for searching for people, searching by DOI, and searching by Orcid ID.
View DocumentationClient for the Open Citations Corpus
Client for the Open Citations Corpus (http://opencitations.net/). Includes a set of functions for getting one identifier type from another, as well as getting references and citations for a given identifier.
View DocumentationFind Free Versions of Scholarly Publications via Unpaywall
This web client interfaces Unpaywall https://unpaywall.org/products/api, formerly oaDOI, a service finding free full-texts of academic papers by linking DOIs with open access journals and repositories. It provides unified access to various data sources for open access full-text links including Crossref and the Directory of Open Access Journals (DOAJ). API usage is free and no registration is required.
Scientific use casesAccess Publisher Copyright & Self-Archiving Policies via the SHERPA/RoMEO API
Fetches information from the SHERPA/RoMEO API http://www.sherpa.ac.uk/romeo/apimanual.php which indexes policies of journal regarding the archival of scientific manuscripts before and/or after peer-review as well as formatted manuscripts.
Scientific use casesR Interface to the Europe PubMed Central RESTful Web Service
An R Client for the Europe PubMed Central RESTful Web Service (see https://europepmc.org/RestfulWebService for more information). It gives access to both metadata on life science literature and open access full texts. Europe PMC indexes all PubMed content and other literature sources including Agricola, a bibliographic database of citations to the agricultural literature, or Biological Patents. In addition to bibliographic metadata, the client allows users to fetch citations and reference lists. Links between life-science literature and other EBI databases, including ENA, PDB or ChEMBL are also accessible. No registration or API key is required. See the vignettes for usage examples.
View DocumentationRead Data from JSTOR/DfR
Functions and helpers to import metadata, ngrams and full-texts delivered by Data for Research by JSTOR.
View DocumentationText Extraction, Rendering and Converting of PDF Documents
Utilities based on libpoppler for extracting text, fonts, attachments and metadata from a PDF file. Also supports high quality rendering of PDF documents into PNG, JPEG, TIFF format, or into raw bitmap vectors for further processing in R.
Scientific use casesExtract Text from Rich Text Format (RTF) Documents
Wraps the unrtf utility to extract text from RTF files. Supports document conversion to HTML, LaTeX or plain text. Output in HTML is recommended because unrtf has limited support for converting between character encodings.
View DocumentationGeneral Purpose Oai-PMH Services Client
A general purpose client to work with any OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) service. The OAI-PMH protocol is described at http://www.openarchives.org/OAI/openarchivesprotocol.html. Functions are provided to work with the OAI-PMH verbs: GetRecord, Identify, ListIdentifiers, ListMetadataFormats, ListRecords, and ListSets.
Scientific use casesDetect Text Reuse and Document Similarity
Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.
Scientific use casesR Interface to Apache Tika
Extract text or metadata from over a thousand file types, using Apache Tika https://tika.apache.org/. Get either plain text or structured XHTML content.
View DocumentationCall Googles Natural Language API, Cloud Translation' API, Cloud Speech API and Cloud Text-to-Speech API
Call Google Cloud machine learning APIs for text and speech tasks. Call the Cloud Translation API https://cloud.google.com/translate/ for detection and translation of text, the Natural Language API https://cloud.google.com/natural-language/ to analyse text for sentiment, entities or syntax, the Cloud Speech API https://cloud.google.com/speech/ to transcribe sound files to text and the Cloud Text-to-Speech API https://cloud.google.com/text-to-speech/ to turn text into sound files.
View DocumentationClient for the DataCite API
Client for the web service methods provided by DataCite (https://www.datacite.org/), including functions to interface with their RESTful search API. The API is backed by Elasticsearch, allowing expressive queries, including faceting.
Scientific use casesauthor name disambiguation, author georeferencing, and mapping of coauthorship networks with Web of Science data
Tools to parse and organize reference records downloaded from the Web of Science citation database into an R-friendly format, disambiguate the names of authors, geocode their locations, and generate/visualize coauthorship networks. This package has been peer-reviewed by rOpenSci (v. 1.0).
View DocumentationCitation Style Language (CSL) Utilities
Tools for working with the Citation Style Language (CSL) (https://citationstyles.org), an XML-based format describing the formatting of citations, notes and bibliographies. Functions are included for downloading and searching for styles and locales, and loading and parsing styles and locales. seasl aims to help users fetch and modify CSL files for work combining code and writing that requires citations.
View DocumentationGoogle's Compact Language Detector 3
Google’s Compact Language Detector 3 is a neural network model for language identification and the successor of cld2 (available from CRAN). The algorithm is still experimental and takes a novel approach to language detection with different properties and outcomes. It can be useful to combine this with the Bayesian classifier results from cld2. See https://github.com/google/cld3#readme for more information.
View DocumentationText Interchange Format
Provides validation functions for common interchange formats for representing text data in R. Includes formats for corpus objects, document term matrices, and tokens. Other annotations can be stored by overloading the tokens structure.
View DocumentationParse Full Text XML Documents from PubMed Central
Parse XML documents from the Open Access subset of Europe PubMed Central https://europepmc.org including section paragraphs, tables, captions and references.
View DocumentationAPI Wrapper for the UK REF 2014 Impact Case Studies Database
Provides wrapper functions around the UK Research Excellence Framework 2014 Impact Case Studies Database API http://impact.ref.ac.uk/. The database contains relevant publication and research metadata about each case study as well as several paragraphs of text from the case study submissions. Case studies in the database are licenced under a CC-BY 4.0 licence http://creativecommons.org/licenses/by/4.0/legalcode.
View DocumentationRetrieves Altmerics Data for Any Published Paper from Altmetric.com
Provides a programmatic interface to the citation information and alternate metrics provided by Altmetric. Data from Altmetric allows researchers to immediately track the impact of their published work, without having to wait for citations. This allows for faster engagement with the audience interested in your work. For more information, visit https://www.altmetric.com/.
Scientific use casesInterface to the arXiv API
An interface to the API for arXiv (https://arxiv.org), a repository of electronic preprints for computer science, mathematics, physics, quantitative biology, quantitative finance, and statistics.
Scientific use casesGoogle's Compact Language Detector 2
Bindings to Google’s C++ library Compact Language Detector 2 (see https://github.com/cld2owners/cld2#readme for more information). Probabilistically detects over 80 languages in plain text or HTML. For mixed-language input it returns the top three detected languages and their approximate proportion of the total classified text bytes (e.g. 80% English and 20% French out of 1000 bytes). There is also a cld3 package on CRAN which uses a neural network model instead.
Scientific use casesSplit, Combine and Compress PDF Files
Content-preserving transformations transformations of PDF files such as split, combine, and compress. This package interfaces directly to the qpdf C++ API and does not require any command line utilities. Note that qpdf does not read actual content from PDF files: to extract text and data you need the pdftools package.
View DocumentationExtract Text from Microsoft Word Documents
Wraps the AntiWord utility to extract text from Microsoft Word documents. The utility only supports the old doc format, not the new xml based docx format. Use the xml2 package to read the latter.
View DocumentationBindings for Tabula PDF Table Extractor Library
Bindings for the Tabula http://tabula.technology/ Java library, which can extract tables from PDF documents. The tabulizerjars package https://github.com/ropensci/tabulizerjars provides versioned Java .jar files, including all dependencies, aligned to releases of Tabula.
Scientific use casesInterface to the IEEE Xplore Gateway
An interface to the IEEE Xplore Gateway, for searching IEEE publications.
View Documentation