rOpenSci | Literature

Literature

Analyze Scientific Papers (and Text in General)
Showing 10 of 12

Full Text of Scholarly Articles Across Many Data Sources

Scott Chamberlain
Description

Provides a single interface to many sources of full text scholarly data, including Biomed Central, Public Library of Science, Pubmed Central, eLife, F1000Research, PeerJ, Pensoft, Hindawi, arXiv preprints, and more. Functionality included for searching for articles, downloading full or partial text, downloading supplementary materials, converting to various data formats.

Scientific use cases
  1. Bauer, P. C., Barbera, P., & Munzert, S. (2016). The Quality of Citations: Towards Quantifying Qualitative Impact in Social Science Research. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2874549
  2. Piper, A. M., Batovska, J., Cogan, N. O. I., Weiss, J., Cunningham, J. P., Rodoni, B. C., & Blacket, M. J. (2019). Prospects and challenges of implementing DNA metabarcoding for high-throughput insect surveillance. GigaScience, 8(8). https://doi.org/10.1093/gigascience/giz092
  3. Mishra, P., & Narayan Tripathi, L. (2019). Characterization of two‐dimensional materials from Raman spectral data. Journal of Raman Spectroscopy. https://doi.org/10.1002/jrs.5744
  4. Vitale, O., Preste, R., Palmisano, D., & Attimonelli, M. (2019). A data and text mining pipeline to annotate human mitochondrial variants with functional and clinical information. Molecular Genetics & Genomic Medicine, 8(2). https://doi.org/10.1002/mgg3.1085
  5. Joo, R., Picardi, S., Boone, M. E., Clay, T. A., Patrick, S. C., Romero-Romero, V. S., & Basille, M. (2020). A decade of movement ecology. arXiv preprint arXiv:2006.00110 https://arxiv.org/pdf/2006.00110.pdf
View Documentation

Convert Among Citation Formats

Scott Chamberlain
Description

Converts among many citation formats, including BibTeX, Citeproc, Codemeta, RDF XML, RIS, Schema.org, and Citation File Format. A low level R6 class is provided, as well as stand-alone functions for each citation format for both read and write.

View Documentation
patentsview
CRAN Peer-reviewed

An R Client to the PatentsView API

Christopher Baker
Description

Provides functions to simplify the PatentsView API (http://www.patentsview.org/api/doc.html) query language, send GET and POST requests to the API’s seven endpoints, and parse the data that comes back.

View Documentation
medrxivr
CRAN Peer-reviewed

Access and Search MedRxiv and BioRxiv Preprint Data

Luke McGuinness
Description

An increasingly important source of health-related bibliographic content are preprints - preliminary versions of research articles that have yet to undergo peer review. The two preprint repositories most relevant to health-related sciences are medRxiv https://www.medrxiv.org/ and bioRxiv https://www.biorxiv.org/, both of which are operated by the Cold Spring Harbor Laboratory. medrxivr provides programmatic access to the Cold Spring Harbour Laboratory (CSHL) API https://api.biorxiv.org/, allowing users to easily download medRxiv and bioRxiv preprint metadata (e.g. title, abstract, publication date, author list, etc) into R. medrxivr also provides functions to search the downloaded preprint records using regular expressions and Boolean logic, as well as helper functions that allow users to export their search results to a .BIB file for easy import to a reference manager and to download the full-text PDFs of preprints matching their search criteria.

View Documentation

Interface to the Search API for PLoS Journals

Scott Chamberlain
Description

A programmatic interface to the SOLR based search API (http://api.plos.org/) provided by the Public Library of Science journals to search their articles. Functions are included for searching for articles, retrieving articles, making plots, doing faceted searches, highlight searches, and viewing results of highlighted searches in a browser.

Scientific use cases
  1. Hartgerink, C. H. J., van Aert, R. C. M., Nuijten, M. B., Wicherts, J. M., & van Assen, M. A. L. M. (2016). Distributions ofp-values smaller than .05 in psychology: what is going on? PeerJ, 4, e1935. https://doi.org/10.7717/peerj.1935
  2. White, E. (2015). Some thoughts on best publishing practices for scientific software. IEE, 8. https://doi.org/10.4033/iee.2015.8.9.c
  3. Gálvez, R. H. (2017). Assessing author self-citation as a mechanism of relevant knowledge diffusion. Scientometrics. https://doi.org/10.1007/s11192-017-2330-1
  4. Li, K., Yan, E., & Feng, Y. (2017). How is R cited in research outputs? Structure, impacts, and citation standard. Journal of Informetrics, 11(4), 989–1002. https://doi.org/10.1016/j.joi.2017.08.003
  5. Federer LM, Belter CW, Joubert DJ, Livinski A, Lu YL, et al. (2018) Data sharing in PLOS ONE: An analysis of Data Availability Statements. PLOS ONE 13(5): e0194768. https://doi.org/10.1371/journal.pone.0194768
  6. Jaspers, S., De Troyer, E., & Aerts, M. (2018). Machine learning techniques for the automation of literature reviews and systematic reviews in EFSA. EFSA Supporting Publications, 15(6), 1427E. https://doi.org/10.2903/sp.efsa.2018.EN-1427
  7. Nuijten, M. B. (2018, April 30). Research on Research: A Meta-Scientific Study of Problems and Solutions in Psychological Science. https://doi.org/10.31234/osf.io/qtk7e
  8. Enkhbayar, A., Haustein, S., Barata, G., & Alperin, J. P. (2019). How much research shared on Facebook is hidden from public view? A comparison of public and private online activity around PLOS ONE papers. arXiv preprint arXiv:1909.01476. https://arxiv.org/abs/1909.01476
  9. Mishra, P., & Narayan Tripathi, L. (2019). Characterization of two‐dimensional materials from Raman spectral data. Journal of Raman Spectroscopy. https://doi.org/10.1002/jrs.5744
  10. Vílchez-Román, C., Huamán-Delgado, F., & Alhuay-Quispe, J. (2020). Social dimension activates the usage and academic impact of Open Access publications in Andean countries: a structural modeling-based approach. Information Development, 026666692090184. https://doi.org/10.1177/0266666920901849
  11. Enkhbayar, A., Haustein, S., Barata, G., & Alperin, J. P. (2020). How much research shared on Facebook happens outside of public pages and groups? A comparison of public and private online activity around PLOS ONE papers. Quantitative Science Studies, 1–22. https://doi.org/10.1162/qss_a_00044
View Documentation
lingtypology
CRAN Peer-reviewed

Linguistic Typology and Mapping

George Moroz
Description

Provides R with the Glottolog database https://glottolog.org/ and some more abilities for purposes of linguistic mapping. The Glottolog database contains the catalogue of languages of the world. This package helps researchers to make a linguistic maps, using philosophy of the Cross-Linguistic Linked Data project https://clld.org/, which allows for while at the same time facilitating uniform access to the data across publications. A tutorial for this package is available on GitHub pages https://docs.ropensci.org/lingtypology/ and package vignette. Maps created by this package can be used both for the investigation and linguistic teaching. In addition, package provides an ability to download data from typological databases such as WALS, AUTOTYP and some others and to create your own database website.

Scientific use cases
  1. Maisak, T. (2017). Repetitive prefix in Agul: Morphological copy from a closely related language. International Journal of Bilingualism, 136700691774006. https://doi.org/10.1177/1367006917740060
  2. Roettger, T., & Gordon, M. (2017). Methodological issues in the study of word stress correlates. Linguistics Vanguard, 3(1). http://www.linguistics.ucsb.edu/faculty/gordon/Roettger&Gordon_AcousticMethodologoy.pdf
  3. Hantgan-Sonko, A. (2020). Synchronic and diachronic strategies of mora preservation in Gújjolaay Eegimaa. Journal of African Languages and Literatures, (1), 1-25. http://www.politics.unina.it/index.php/jalalit/article/download/6732/7790
  4. Ye, J. (2020). Independent and dependent possessive person forms. Studies in Language, 44(2), 363–406. https://doi.org/10.1075/sl.19020.ye
View Documentation

Client for Various CrossRef APIs

Scott Chamberlain
Description

Client for various CrossRef APIs, including metadata search with their old and newer search APIs, get citations in various formats (including bibtex, citeproc-json, rdf-xml, etc.), convert DOIs to PMIDs, and vice versa, get citations for DOIs, and get links to full text of articles when available.

Scientific use cases
  1. Jahn, N., & Tullney, M. (2016). A study of institutional spending on open access publication fees in Germany. PeerJ, 4, e2323. https://doi.org/10.7717/peerj.2323
  2. Lammey, R. (2016). Using the Crossref Metadata API to explore publisher content. Sci Ed, 3(2), 109–111. https://doi.org/10.6087/kcse.75
  3. Bauer, P. C., Barbera, P., & Munzert, S. (2016). The Quality of Citations: Towards Quantifying Qualitative Impact in Social Science Research. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2874549
  4. Cho, H., & Yu, Y. (2018). Link prediction for interdisciplinary collaboration via co-authorship network. arXiv preprint arXiv:1803.06249. https://arxiv.org/pdf/1803.06249.pdf
  5. Jaspers, S., De Troyer, E., & Aerts, M. (2018). Machine learning techniques for the automation of literature reviews and systematic reviews in EFSA. EFSA Supporting Publications, 15(6), 1427E. https://doi.org/10.2903/sp.efsa.2018.EN-1427
  6. Hicks, D. J., Coil, D. A., Stahmer, C. G., & Eisen, J. A. (2019). Network analysis to evaluate the impact of research funding on research community consolidation. https://doi.org/10.1101/534495
  7. Olsson-Collentine, A., van Assen, M. A. L. M., & Hartgerink, C. H. J. (2019). The Prevalence of Marginally Significant Results in Psychology Over Time. Psychological Science, 095679761983032. https://doi.org/10.1177/0956797619830326
  8. Matthias, L., Jahn, N., & Laakso, M. (2019). The Two-Way Street of Open Access Journal Publishing - Flip It and Reverse It. Publications. 7(2), 23. https://doi.org/10.3390/publications7020023
  9. Mishra, P., & Narayan Tripathi, L. (2019). Characterization of two‐dimensional materials from Raman spectral data. Journal of Raman Spectroscopy. https://doi.org/10.1002/jrs.5744
  10. Fu, D. Y., & Hughey, J. J. (2019). Releasing a preprint is associated with more attention and citations for the peer-reviewed article. eLife, 8. https://doi.org/10.7554/elife.52646
  11. Fraser, N., Momeni, F., Mayr, P., & Peters, I. (2020). The relationship between bioRxiv preprints, citations and altmetrics. Quantitative Science Studies, 1–21. https://doi.org/10.1162/qss_a_00043
View Documentation

Fetch Scholary Full Text from Crossref

Scott Chamberlain
Description

Text mining client for Crossref (https://crossref.org). Includes functions for getting getting links to full text of articles, fetching full text articles from those links or Digital Object Identifiers (DOIs), and text extraction from PDFs.

View Documentation

Client for Citoid

Scott Chamberlain
Description

Client for Citoid (https://www.mediawiki.org/wiki/Citoid), an API for getting citations for various scholarly work identifiers found on Wikipedia.

View Documentation
microdemic
CRAN Staff maintained

Microsoft Academic API Client

Scott Chamberlain
Description

The Microsoft Academic Knowledge API provides programmatic access to scholarly articles in the Microsoft Academic Graph (https://academic.microsoft.com/). Includes methods matching all ‘Microsoft Academic’ API routes, including search, graph search, text similarity, and interpret natural language query string.

View Documentation

High-Performance Stemmer, Tokenizer, and Spell Checker

Jeroen Ooms
Description

Low level spell checker and morphological analyzer based on the famous hunspell library https://hunspell.github.io. The package can analyze or check individual words as well as parse text, latex, html or xml documents. For a more user-friendly interface use the spelling package which builds on this package to automate checking of files, documentation and vignettes in all common formats.

Scientific use cases
  1. Cichosz, P. (2018) A case study in text mining of discussion forum posts: classification with bag of words and global vectors Int. J. Appl. Math. Comput. Sci., Vol. 28, No. 4, 787–801. https://www.amcs.uz.zgora.pl/?action=paper&paper=1469
  2. Yeomans, M., Kantor, A., & Tingley, D. (2018). The politeness Package: Detecting Politeness in Natural Language. The R Journal. https://journal.r-project.org/archive/2018/RJ-2018-067/RJ-2018-067.pdf
  3. Lee, A. J., Jones, B. C., & DeBruine, L. M. (2019, January 21). Investigating the association between mating-relevant self-concepts and mate preferences through a data-driven analysis of online personal descriptions. https://doi.org/10.31234/osf.io/38zef
  4. Liu, Crocker H., Nowak, Adam, and Smith, Patrick S. 2018. Does the Asset Pricing Premium Reflect Asymmetric or IncompleteInformation?. Economics Faculty Working Papers Series. 5. https://researchrepository.wvu.edu/econ_working-papers/5
  5. Nicolas, G., Bai, X., & Fiske, S. T. (2019). Automated Dictionary Creation for Analyzing Text: An Illustration from Stereotype Content. https://psyarxiv.com/afm8k/download?format=pdf
  6. Bayer, D., & Michael, S. (2019). Exploring the Daschle Collection using Text Mining. arXiv preprint arXiv:1904.12623 https://arxiv.org/pdf/1904.12623
  7. Green, E. P., Whitcomb, A., Kahumbura, C., Rosen, J. G., Goyal, S., Achieng, D., & Bellows, B. (2019). What is the best method of family planning for me?: a text mining analysis of messages between users and agents of a digital health service in Kenya. Gates Open Research, 3, 1475. https://doi.org/10.12688/gatesopenres.12999.1
  8. Lin, C., Lou, Y.-S., Tsai, D.-J., Lee, C.-C., Hsu, C.-J., Wu, D.-C., … Fang, W.-H. (2019). Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study. JMIR Medical Informatics, 7(3), e14499. https://doi.org/10.2196/14499
  9. Luc, A., Lê, S., & Philippe, M. (2019). Nudging consumers for relevant data using Free JAR profiling: an application to product development. Food Quality and Preference, 103751. https://doi.org/10.1016/j.foodqual.2019.103751
  10. Ramagopalan, S. V., Malcolm, B., Merinopoulou, E., McDonald, L., & Cox, A. (2019). Automated extraction of treatment patterns from social media posts: an exploratory analysis in renal cell carcinoma. Future Oncology. https://doi.org/10.2217/fon-2019-0406
  11. Cinelli, M., Ficcadenti, V., & Riccioni, J. (2019). The interconnectedness of the economic content in the speeches of the US Presidents. Annals of Operations Research. https://doi.org/10.1007/s10479-019-03372-2
  12. Christensen, A. P., & Kenett, Y. (2019, October 22). Semantic Network Analysis (SemNA): A Tutorial on Preprocessing, Estimating, and Analyzing Semantic Networks. https://doi.org/10.31234/osf.io/eht87
  13. Booth, A., Bell, T., Halhol, S., Pan, S., Welch, V., Merinopoulou, E., … Cox, A. (2019). Using Social Media to Uncover Treatment Experiences and Decisions in Patients With Acute Myeloid Leukemia or Myelodysplastic Syndrome Who Are Ineligible for Intensive Chemotherapy: Patient-Centric Qualitative Data Analysis. Journal of Medical Internet Research, 21(11), e14285. https://doi.org.10.2196/14285
  14. Deng, H., Wang, Q., Turner, D. P., Sexton, K. E., Burns, S. M., Eikermann, M., … Houle, T. T. (2020). Sentiment analysis of real-world migraine tweets for population research. Cephalalgia Reports, 3, 251581631989886. https://doi.org/10.1177/2515816319898867
  15. Cinelli, M. (2019). Generalized rich-club ordering in networks. Journal of Complex Networks, 7(5), 702–719. https://doi.org/10.1093/comnet/cnz002
  16. Funk, B., Sadeh-Sharvit, S., Fitzsimmons-Craft, E. E., Trockel, M. T., Monterubio, G. E., Goel, N. J., … Taylor, C. B. (2020). A Framework for Applying Natural Language Processing in Digital Health Interventions. Journal of Medical Internet Research, 22(2), e13855. https://doi.org/10.2196/13855
  17. Cichosz, P. (2020). Unsupervised modeling anomaly detection in discussion forums posts using global vectors for text representation. Natural Language Engineering, 1–28. https://doi.org/10.1017/s1351324920000066
  18. Pruchnik, P. (2020). Identification of Trends in the Polish Media on the Example of the Quarterly Studia Medioznawcze The Use of Big Data Tools. Media Studies, 80(1). http://yadda.icm.edu.pl/yadda/element/bwmeta1.element.desklight-e79ed2c7-fd7d-4a91-8895-c322743c8f48/c/04_Pruchnik_EN.pdf
  19. Hamilton, L. M., & Lahne, J. (2020). Fast and automated sensory analysis: Using natural language processing for descriptive lexicon development. Food Quality and Preference, 83, 103926. https://doi.org/10.1016/j.foodqual.2020.103926
  20. DellaPosta, D., & Nee, V. (2020). Emergence of diverse and specialized knowledge in a metropolitan tech cluster. Social Science Research, 86, 102377. https://doi.org/10.1016/j.ssresearch.2019.102377
  21. Geller, J., Davis, S. D., & Peterson, D. (2020, May 23). Sans forgetica is not desirable for learning. https://doi.org/10.31234/osf.io/ku5bz
  22. Morselli, D., Passini, S., & McGarty, C. (2020). Sos Venezuela: an analysis of the anti-Maduro protest movements using Twitter. Social Movement Studies, 1–22. https://doi.org/10.1080/14742837.2020.1770072
  23. Ficcadenti, V., Cerqueti, R., Ausloos, M., & Dhesi, G. (2020). Words ranking and Hirsch index for identifying the core of the hapaxes in political texts. Journal of Informetrics, 14(3), 101054. https://doi.org/10.1016/j.joi.2020.101054
View Documentation

Fetch Sections of XML Scholarly Articles

Scott Chamberlain
Description

Get chunks of XML scholarly articles without having to know how to work with XML. Custom mappers for each publisher and for each article section pull out the information you want. Works with outputs from package fulltext, xml2 package documents, and file paths to XML documents.

View Documentation

Interface to the Orcid.org API

Scott Chamberlain
Description

Client for the Orcid.org API (https://orcid.org/). Functions included for searching for people, searching by DOI, and searching by Orcid ID.

View Documentation

Client for the Open Citations Corpus

Scott Chamberlain
Description

Client for the Open Citations Corpus (http://opencitations.net/). Includes a set of functions for getting one identifier type from another, as well as getting references and citations for a given identifier.

View Documentation

Find Free Versions of Scholarly Publications via Unpaywall

Najko Jahn
Description

This web client interfaces Unpaywall https://unpaywall.org/products/api, formerly oaDOI, a service finding free full-texts of academic papers by linking DOIs with open access journals and repositories. It provides unified access to various data sources for open access full-text links including Crossref and the Directory of Open Access Journals (DOAJ). API usage is free and no registration is required.

Scientific use cases
  1. Ashby, M. P. J. (2020, March 6). Three quarters of new criminological knowledge is hidden from policy makers. https://doi.org/10.31235/osf.io/wnq7h
View Documentation

Access Publisher Copyright & Self-Archiving Policies via the SHERPA/RoMEO API

Matthias Grenié
Description

Fetches information from the SHERPA/RoMEO API http://www.sherpa.ac.uk/romeo/apimanual.php which indexes policies of journal regarding the archival of scientific manuscripts before and/or after peer-review as well as formatted manuscripts.

Scientific use cases
  1. Ashby, M. P. J. (2020, March 6). Three quarters of new criminological knowledge is hidden from policy makers. https://doi.org/10.31235/osf.io/wnq7h
View Documentation
europepmc
CRAN Peer-reviewed

R Interface to the Europe PubMed Central RESTful Web Service

Najko Jahn
Description

An R Client for the Europe PubMed Central RESTful Web Service (see https://europepmc.org/RestfulWebService for more information). It gives access to both metadata on life science literature and open access full texts. Europe PMC indexes all PubMed content and other literature sources including Agricola, a bibliographic database of citations to the agricultural literature, or Biological Patents. In addition to bibliographic metadata, the client allows users to fetch citations and reference lists. Links between life-science literature and other EBI databases, including ENA, PDB or ChEMBL are also accessible. No registration or API key is required. See the vignettes for usage examples.

View Documentation

Read Data from JSTOR/DfR

Thomas Klebel
Description

Functions and helpers to import metadata, ngrams and full-texts delivered by Data for Research by JSTOR.

View Documentation

Text Extraction, Rendering and Converting of PDF Documents

Jeroen Ooms
Description

Utilities based on libpoppler for extracting text, fonts, attachments and metadata from a PDF file. Also supports high quality rendering of PDF documents into PNG, JPEG, TIFF format, or into raw bitmap vectors for further processing in R.

Scientific use cases
  1. Cole, C. B., Patel, S., French, L., & Knight, J. (2016). Semi-Automated Identification of Ontological Labels in the Biomedical Literature with goldi. https://doi.org/10.1101/073460
  2. Krotov, V., & Tennyson, M. (2018). Scraping Financial Data from the Web Using R Language. Journal of Emerging Technologies in Accounting. https://doi.org/10.2308/jeta-52063
  3. Iqbal, J. (2019). Managerial Self-Attribution Bias and Banks’ Future Performance: Evidence from Emerging Economies. Journal of Risk and Financial Management, 12(2), 73. https://doi.org/10.3390/jrfm12020073
  4. Hanna, A., & Hanna, L.-A. (2019). Topic Analysis of UK Fitness to Practise Cases: What Lessons Can Be Learnt? Pharmacy, 7(3), 130. https://doi.org/10.3390/pharmacy7030130
  5. Hwang, L. J., Pauloo, R. A., & Carlen, J. (2019). Assessing Impact of Outreach through Software Citation for Community Software in Geodynamics. Computing in Science & Engineering, 1–1. https://doi.org/10.1109/mcse.2019.2940221
  6. Ulibarri, N., & Scott, T. A. (2019). Environmental hazards, rigid institutions, and transformative change: How drought affects the consideration of water and climate impacts in infrastructure management. Global Environmental Change, 59, 102005. https://doi.org/10.1016/j.gloenvcha.2019.102005
  7. Lope, D. J., & Dolgun, A. (2020). Measuring the inequality of accessible trams in Melbourne. Journal of Transport Geography, 83, 102657. https://doi.org/10.1016/j.jtrangeo.2020.102657
  8. Verde Arregoitia, L. D., Teta, P., & D’Elía, G. (2020). Patterns in research and data sharing for the study of form and function in caviomorph rodents. Journal of Mammalogy. https://doi.org/10.1093/jmammal/gyaa002
  9. Hagan, A. K., Pollet, R. M., & Libertucci, J. (2020). Suggestions for Improving Invited Speaker Diversity To Reflect Trainee Diversity. Journal of Microbiology & Biology Education, 21(1). https://doi.org/10.1128/jmbe.v21i1.2105
View Documentation

Extract Text from Rich Text Format (RTF) Documents

Jeroen Ooms
Description

Wraps the unrtf utility to extract text from RTF files. Supports document conversion to HTML, LaTeX or plain text. Output in HTML is recommended because unrtf has limited support for converting between character encodings.

View Documentation

General Purpose Oai-PMH Services Client

Scott Chamberlain
Description

A general purpose client to work with any OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) service. The OAI-PMH protocol is described at http://www.openarchives.org/OAI/openarchivesprotocol.html. Functions are provided to work with the OAI-PMH verbs: GetRecord, Identify, ListIdentifiers, ListMetadataFormats, ListRecords, and ListSets.

Scientific use cases
  1. Peters, I., Kraker, P., Lex, E., Gumpenberger, C., & Gorraiz, J. I. (2017). Zenodo in the Spotlight of Traditional and New Metrics. Frontiers in Research Metrics and Analytics, 2. https://doi.org/10.3389/frma.2017.00013
View Documentation
textreuse
CRAN Peer-reviewed

Detect Text Reuse and Document Similarity

Lincoln Mullen
Description

Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.

Scientific use cases
  1. Funk, K. R., & Mullen, L. A. (2017). The Spine of American Law: Digital Text Analysis and US Legal Practice. The American Historical Review. https://doi.org/10.1093/ahr/123.1.132
  2. A. Mullen, L., Benoit, K., Keyes, O., Selivanov, D., & Arnold, J. (2018). Fast, Consistent Tokenization of Natural Language Text. Journal of Open Source Software, 3(23), 655. https://doi.org/10.21105/joss.00655
  3. García, F. T., Villalba, L. J. G., Orozco, A. L. S., Ruiz, F. D. A., Juárez, A. A., & Kim, T. H. (2018). Locating similar names through locality sensitive hashing and graph theory. Multimedia Tools and Applications, 1-14. https://link.springer.com/article/10.1007/s11042-018-6375-9
  4. Catalano, J. (2018). Digitally Analyzing the Uneven Ground: Language Borrowing Among Indian Treaties. Current Research in Digital History, 1. https://doi.org/10.31835/crdh.2018.02
  5. Schmidt, B. (2018). Stable random projection: lightweight, general-purpose dimensionality reduction for digitized libraries. Journal of Cultural Analytics. https://doi.org/10.22148/16.025
  6. Sanger, W., & Warin, T. (2019). Dataset of Jaccard similarity indices from 1,597 European political manifestos across 27 countries (1945–2017). Data in Brief, 103907. https://doi.org/10.1016/j.dib.2019.103907
  7. Jaric, I., & Djeric, M. (2019). Curriculum and labor market: Comparative analysis of the curricular outcomes of the study program in sociology at the Faculty of Philosophy, University of Belgrade and the required competences in the labor market. Sociologija, 61(Suppl. 1), 718–741. https://doi.org/10.2298/soc19s1718j
  8. Marple, T. (2020). The social management of complex uncertainty: Central Bank similarity and crisis liquidity swaps at the Federal Reserve. The Review of International Organizations. https://doi.org/10.1007/s11558-020-09378-x
  9. Callaghan, T., Karch, A., & Kroeger, M. (2020). Model State Legislation and Intergovernmental Tensions over the Affordable Care Act, Common Core, and the Second Amendment. Publius: The Journal of Federalism. https://doi.org/10.1093/publius/pjaa012
  10. Vogler, D., Udris, L., & Eisenegger, M. (2020). Measuring Media Content Concentration at a Large Scale Using Automated Text Comparisons. Journalism Studies, 1–20. https://doi.org/10.1080/1461670x.2020.1761865
  11. Vogler, D., & Schäfer, M. S. (2020). Growing Influence of University PR on Science News Coverage? A Longitudinal Automated Content Analysis of University Media Releases and Newspaper Coverage in Switzerland, 2003‒2017. International Journal of Communication, 14, 22. https://ijoc.org/index.php/ijoc/article/download/13498/3113
  12. James, S., Pagliari, S., & Young, K. L. (2020). The internationalization of European financial networks: a quantitative text analysis of EU consultation responses. Review of International Political Economy, 1–28. https://doi.org/10.1080/09692290.2020.1779781
View Documentation

R Interface to Apache Tika

Sasha Goodman
Description

Extract text or metadata from over a thousand file types, using Apache Tika https://tika.apache.org/. Get either plain text or structured XHTML content.

View Documentation
googleLanguageR
CRAN Peer-reviewed

Call Googles Natural Language API, Cloud Translation' API, Cloud Speech API and Cloud Text-to-Speech API

Mark Edmondson
Description

Call Google Cloud machine learning APIs for text and speech tasks. Call the Cloud Translation API https://cloud.google.com/translate/ for detection and translation of text, the Natural Language API https://cloud.google.com/natural-language/ to analyse text for sentiment, entities or syntax, the Cloud Speech API https://cloud.google.com/speech/ to transcribe sound files to text and the Cloud Text-to-Speech API https://cloud.google.com/text-to-speech/ to turn text into sound files.

View Documentation

Client for the DataCite API

Scott Chamberlain
Description

Client for the web service methods provided by DataCite (https://www.datacite.org/), including functions to interface with their RESTful search API. The API is backed by Elasticsearch, allowing expressive queries, including faceting.

Scientific use cases
  1. Jaspers, S., De Troyer, E., & Aerts, M. (2018). Machine learning techniques for the automation of literature reviews and systematic reviews in EFSA. EFSA Supporting Publications, 15(6), 1427E. https://doi.org/10.2903/sp.efsa.2018.EN-1427
  2. White, L., & Santy, S. (2018). DataDepsGenerators.jl: making reusing data easy by automatically generating DataDeps.jl registration code. Journal of Open Source Software, 3(31), 921. https://doi.org/10.21105/joss.00921
View Documentation
refsplitr
Peer-reviewed

author name disambiguation, author georeferencing, and mapping of coauthorship networks with Web of Science data

Emilio Bruna
Description

Tools to parse and organize reference records downloaded from the Web of Science citation database into an R-friendly format, disambiguate the names of authors, geocode their locations, and generate/visualize coauthorship networks. This package has been peer-reviewed by rOpenSci (v. 1.0).

View Documentation

Citation Style Language (CSL) Utilities

Scott Chamberlain
Description

Tools for working with the Citation Style Language (CSL) (https://citationstyles.org), an XML-based format describing the formatting of citations, notes and bibliographies. Functions are included for downloading and searching for styles and locales, and loading and parsing styles and locales. seasl aims to help users fetch and modify CSL files for work combining code and writing that requires citations.

View Documentation

Google's Compact Language Detector 3

Jeroen Ooms
Description

Google’s Compact Language Detector 3 is a neural network model for language identification and the successor of cld2 (available from CRAN). The algorithm is still experimental and takes a novel approach to language detection with different properties and outcomes. It can be useful to combine this with the Bayesian classifier results from cld2. See https://github.com/google/cld3#readme for more information.

View Documentation
tif

Text Interchange Format

Taylor Arnold
Description

Provides validation functions for common interchange formats for representing text data in R. Includes formats for corpus objects, document term matrices, and tokens. Other annotations can be stored by overloading the tokens structure.

View Documentation
tidypmc
CRAN

Parse Full Text XML Documents from PubMed Central

Chris Stubben
Description

Parse XML documents from the Open Access subset of Europe PubMed Central https://europepmc.org including section paragraphs, tables, captions and references.

View Documentation
refimpact
Peer-reviewed

API Wrapper for the UK REF 2014 Impact Case Studies Database

Perry Stephenson
Description

Provides wrapper functions around the UK Research Excellence Framework 2014 Impact Case Studies Database API http://impact.ref.ac.uk/. The database contains relevant publication and research metadata about each case study as well as several paragraphs of text from the case study submissions. Case studies in the database are licenced under a CC-BY 4.0 licence http://creativecommons.org/licenses/by/4.0/legalcode.

View Documentation
rAltmetric
CRAN Staff maintained

Retrieves Altmerics Data for Any Published Paper from Altmetric.com

Karthik Ram
Description

Provides a programmatic interface to the citation information and alternate metrics provided by Altmetric. Data from Altmetric allows researchers to immediately track the impact of their published work, without having to wait for citations. This allows for faster engagement with the audience interested in your work. For more information, visit https://www.altmetric.com/.

Scientific use cases
  1. Madden, K., Evaniew, N., Scott, T., Domazetoska, E., Dosanjh, P., Li, C. S., … Sprague, S. (2016). Knowledge Dissemination of Intimate Partner Violence Intervention Studies Measured Using Alternative Metrics Results From a Scoping Review. Journal of Interpersonal Violence. https://doi.org/10.1177/0886260516657914
  2. Na, J.-C., & Ye, Y. E. (2017). Content Analysis of Scholarly Discussions of Psychological Academic Articles on Facebook. Online Information Review, 41(3). https://doi.org/10.1108/oir-02-2016-0058
  3. Ruano, J., Aguilar-Luque, M., Gómez-Garcia, F., Alcalde Mellado, P., Gay-Mimbrera, J., Carmona-Fernandez, P. J., … Isla-Tejera, B. (2018). The differential impact of scientific quality, bibliometric factors, and social media activity on the influence of systematic reviews and meta-analyses about psoriasis. PLOS ONE, 13(1), e0191124. https://doi.org/10.1371/journal.pone.0191124
  4. Nabout, J. C., Teresa, F. B., Machado, K. B., do Prado, V. H. M., Bini, L. M., & Diniz-Filho, J. A. F. (2018). Do traditional scientometric indicators predict social media activity on scientific knowledge? An analysis of the ecological literature. Scientometrics. https://doi.org/10.1007/s11192-018-2678-x
  5. Araujo, R. F., & Alves, M. (2018). The altmetric performance of publications authored by Brazilian researchers: analysis of CNPq productivity scholarship holders. arXiv preprint arXiv:1807.06366. https://arxiv.org/abs/1807.06366
  6. Sun, Z., Cang, J., Ruan, Y., & Zhu, D. (2019). Reporting gaps between news media and scientific papers on outdoor air pollution–related health outcomes: A content analysis. The International Journal of Health Planning and Management. https://doi.org/10.1002/hpm.2894
  7. Fu, D. Y., & Hughey, J. J. (2019). Releasing a preprint is associated with more attention and citations for the peer-reviewed article. eLife, 8. https://doi.org/10.7554/elife.52646
View Documentation
aRxiv
CRAN

Interface to the arXiv API

Karl Broman
Description

An interface to the API for arXiv (https://arxiv.org), a repository of electronic preprints for computer science, mathematics, physics, quantitative biology, quantitative finance, and statistics.

Scientific use cases
  1. Jaspers, S., De Troyer, E., & Aerts, M. (2018). Machine learning techniques for the automation of literature reviews and systematic reviews in EFSA. EFSA Supporting Publications, 15(6), 1427E. https://doi.org/10.2903/sp.efsa.2018.EN-1427
View Documentation

Google's Compact Language Detector 2

Jeroen Ooms
Description

Bindings to Google’s C++ library Compact Language Detector 2 (see https://github.com/cld2owners/cld2#readme for more information). Probabilistically detects over 80 languages in plain text or HTML. For mixed-language input it returns the top three detected languages and their approximate proportion of the total classified text bytes (e.g. 80% English and 20% French out of 1000 bytes). There is also a cld3 package on CRAN which uses a neural network model instead.

Scientific use cases
  1. Martín-Martín, A., Orduna-Malea, E., Thelwall, M., & López-Cózar, E. D. (2018). Google Scholar, Web of Science, and Scopus: a systematic comparison of citations in 252 subject categories. arXiv preprint arXiv:1808.05053 https://arxiv.org/abs/1808.05053
  2. Albrecht, U.-V., Hasenfuß, G., & von Jan, U. (2018). Description of Cardiological Apps From the German App Store: Semiautomated Retrospective App Store Analysis. JMIR mHealth and uHealth, 6(11), e11753. https://doi.org/10.2196/11753
  3. Green, E. P., Whitcomb, A., Kahumbura, C., Rosen, J. G., Goyal, S., Achieng, D., & Bellows, B. (2019). What is the best method of family planning for me?: a text mining analysis of messages between users and agents of a digital health service in Kenya. Gates Open Research, 3, 1475. https://doi.org/10.12688/gatesopenres.12999.1
  4. Jaric, I., & Djeric, M. (2019). Curriculum and labor market: Comparative analysis of the curricular outcomes of the study program in sociology at the Faculty of Philosophy, University of Belgrade and the required competences in the labor market. Sociologija, 61(Suppl. 1), 718–741. https://doi.org/10.2298/soc19s1718j
View Documentation

Split, Combine and Compress PDF Files

Jeroen Ooms
Description

Content-preserving transformations transformations of PDF files such as split, combine, and compress. This package interfaces directly to the qpdf C++ API and does not require any command line utilities. Note that qpdf does not read actual content from PDF files: to extract text and data you need the pdftools package.

View Documentation

Extract Text from Microsoft Word Documents

Jeroen Ooms
Description

Wraps the AntiWord utility to extract text from Microsoft Word documents. The utility only supports the old doc format, not the new xml based docx format. Use the xml2 package to read the latter.

View Documentation
tabulizer
CRAN Peer-reviewed

Bindings for Tabula PDF Table Extractor Library

Tom Paskhalis
Description

Bindings for the Tabula http://tabula.technology/ Java library, which can extract tables from PDF documents. The tabulizerjars package https://github.com/ropensci/tabulizerjars provides versioned Java .jar files, including all dependencies, aligned to releases of Tabula.

Scientific use cases
  1. Baquero, O. S., & Machado, G. (2018). Spatiotemporal dynamics and risk factors for human Leptospirosis in Brazil. Scientific Reports, 8(1). https://doi.org/10.1038/s41598-018-33381-3
  2. Prats, J., & Danis, P.-A. (2019). An epilimnion and hypolimnion temperature model based on air temperature and lake characteristics. Knowledge & Management of Aquatic Ecosystems, (420), 8. https://doi.org/10.1051/kmae/2019001
View Documentation
IEEER

Interface to the IEEE Xplore Gateway

Saul Wiggin
Description

An interface to the IEEE Xplore Gateway, for searching IEEE publications.

View Documentation