January 27, 2014

solr - an R interface to Solr

By Scott Chamberlain

A number of the APIs we interact with (e.g., PLOS full text API, and USGS’s BISON API in rplos and rbison, respectively) expose Solr endpoints. Solr is an Apache hosted project - it is a powerful search server. Given that at least two, and possibly more in the future, of the data providers we interact with provide Solr endpoints, it made sense to create an R package to make robust functions to interact with Solr that work across any Solr endpoint. This is then useful to us, and hopefully others.

The following are a few examples covering some of things you can do in Solr that fall in to six categories:

Search: via solr_search
Grouping: via solr_group
Faceting: via solr_facet
Highlighting: via solr_highlight
Stats: via solr_stats
More like this: via solr_mlt

The solr package generally has two steps for any query: a) send the request given your inputs, and b) parse the output into a useful R data structure. Part a) is quite easy. However, part b) is harder. We are working hard on making parsers that are as general as possible for each of the data formats that are returned by group, facet, highlight, etc., but of course we will still definitely fail in many cases. Please do submit bug reports to our issue tracker so we can make the parsers work better.

Installation

solr is on CRAN, so you can install the more stable version there, and some dependencies.

install.packages("solr")

You can install the development version from Github as follows. Below we’ll use the Github version - most of below is available in the CRAN version too, except solr_group.

install.packages("devtools")
devtools::install_github("ropensci/solr")

Load the library

library("solr")

Define url endpoint and key

As solr is a general interface to Solr endpoints, you need to define the url. Here, we’ll work with the Public Library of Science full text search API (docs here). Some Solr endpoints will require authentication - I should note that we don’t yet handle authentication schemes other than passing in a key in the url, but that’s on the to do list.

url <- 'https://api.plos.org/search'

Search

solr_search(q='*:*', rows=2, fl='id', base=url)
#>                                                              id
#> 1       10.1371/annotation/c313df3a-52bd-4cbe-af14-6676480d1a43
#> 2 10.1371/annotation/c313df3a-52bd-4cbe-af14-6676480d1a43/title

Search for words “sports” and “alcohol” within seven words of each other

solr_search(q='everything:"sports alcohol"~7', fl='title', rows=3, base=url)
#>                                                                                                                                                                         title
#> 1                                      Alcohol Ingestion Impairs Maximal Post-Exercise Rates of Myofibrillar Protein Synthesis following a Single Bout of Concurrent Training
#> 2 “Like Throwing a Bowling Ball at a Battle Ship” Audience Responses to Australian News Stories about Alcohol Pricing and Promotion Policies: A Qualitative Focus Group Study
#> 3                                            Development and Validation of a Risk Score Predicting Substantial Weight Gain over 5 Years in Middle-Aged European Men and Women

Groups

Most recent publication by journal

solr_group(q='*:*', group.field='journal', rows=5, group.limit=1, group.sort='publication_date desc', fl='publication_date, score', base=url)
#>       groupValue numFound start     publication_date score
#> 1       plos one   931323     0 2014-11-24T00:00:00Z     1
#> 2  plos genetics    40603     0 2014-11-20T00:00:00Z     1
#> 3  plos medicine    18514     0 2014-11-18T00:00:00Z     1
#> 4 plos pathogens    35497     0 2014-11-24T00:00:00Z     1
#> 5   plos biology    26133     0 2014-11-18T00:00:00Z     1

First publication by journal

solr_group(q='*:*', group.field='journal', group.limit=1, group.sort='publication_date asc', fl='publication_date, score', fq="publication_date:[1900-01-01T00:00:00Z TO *]", base=url)
#>                          groupValue numFound start     publication_date
#> 1                          plos one   931323     0 2006-12-01T00:00:00Z
#> 2                     plos genetics    40603     0 2005-06-17T00:00:00Z
#> 3                     plos medicine    18514     0 2004-09-07T00:00:00Z
#> 4                    plos pathogens    35497     0 2005-07-22T00:00:00Z
#> 5                      plos biology    26133     0 2003-08-18T00:00:00Z
#> 6                              none    57566     0 2005-08-23T00:00:00Z
#> 7        plos computational biology    29838     0 2005-06-24T00:00:00Z
#> 8  plos neglected tropical diseases    25119     0 2007-08-30T00:00:00Z
#> 9              plos clinical trials      521     0 2006-04-21T00:00:00Z
#> 10                     plos medicin        9     0 2012-04-17T00:00:00Z
#>    score
#> 1      1
#> 2      1
#> 3      1
#> 4      1
#> 5      1
#> 6      1
#> 7      1
#> 8      1
#> 9      1
#> 10     1

solr_facet(q='*:*', facet.field='journal', facet.query='cell,bird', base=url)
#> $facet_queries
#>        term value
#> 1 cell,bird    17
#>
#> $facet_fields
#> $facet_fields$journal
#>                                 X1     X2
#> 1                         plos one 931323
#> 2                    plos genetics  40603
#> 3                   plos pathogens  35497
#> 4       plos computational biology  29838
#> 5                     plos biology  26133
#> 6 plos neglected tropical diseases  25119
#> 7                    plos medicine  18514
#> 8             plos clinical trials    521
#> 9                     plos medicin      9
#>
#>
#> $facet_dates
#> NULL
#>
#> $facet_ranges
#> NULL

Range faceting with > 1 field

head( solr_facet(q='*:*', base=url, facet.range='alm_twitterCount', facet.range.start=5, facet.range.end=1000, facet.range.gap=10)$facet_ranges$alm_twitterCount )
#>   X1    X2
#> 1  5 60938
#> 2 15 13668
#> 3 25  6379
#> 4 35  2952
#> 5 45  2297
#> 6 55  1497

Highlight

solr_highlight(q='alcohol', hl.fl = 'abstract', rows=2, base = url)
#> $`10.1371/journal.pmed.0040151`
#> $`10.1371/journal.pmed.0040151`$abstract
#> [1] "Background: <em>Alcohol</em> consumption causes an estimated 4% of the global disease burden, prompting"
#>
#>
#> $`10.1371/journal.pone.0027752`
#> $`10.1371/journal.pone.0027752`$abstract
#> [1] "Background: The negative influences of <em>alcohol</em> on TB management with regard to delays in seeking"

Stats

solr_stats(q='ecology', stats.field='alm_twitterCount', stats.facet=c('journal','volume'), base=url)
#>   min  max count missing    sum sumOfSquares     mean   stddev
#> 1   0 1624 24326       0 113589     19746631 4.669448 28.10656

More like this

solr_mlt is a function to return similar documents to the ones searched for.

out <- solr_mlt(q='title:"ecology" AND body:"cell"', mlt.fl='title', mlt.mindf=1, mlt.mintf=1, fl='counter_total_all', rows=5, base=url)
out$docs
#>                             id counter_total_all
#> 1 10.1371/journal.pbio.1001805             10102
#> 2 10.1371/journal.pbio.0020440             16630
#> 3 10.1371/journal.pone.0087217              2922
#> 4 10.1371/journal.pone.0040117              2514
#> 5 10.1371/journal.pone.0072525              1112

Raw data?

You can optionally get back raw json or xml from all functions by setting parameter raw=TRUE. You can then parse after the fact with solr_parse, or just process as you wish. For example:

(out <- solr_highlight(q='alcohol', hl.fl = 'abstract', rows=2, base = url, raw=TRUE))
#> [1] "{\"response\":{\"numFound\":15301,\"start\":0,\"docs\":[{},{}]},\"highlighting\":{\"10.1371/journal.pmed.0040151\":{\"abstract\":[\"Background: <em>Alcohol</em> consumption causes an estimated 4% of the global disease burden, prompting\"]},\"10.1371/journal.pone.0027752\":{\"abstract\":[\"Background: The negative influences of <em>alcohol</em> on TB management with regard to delays in seeking\"]}}}\n"
#> attr(,"class")
#> [1] "sr_high"
#> attr(,"wt")
#> [1] "json"

Then parse

solr_parse(out, 'df')
#>                          names
#> 1 10.1371/journal.pmed.0040151
#> 2 10.1371/journal.pone.0027752
#>                                                                                                    abstract
#> 1   Background: <em>Alcohol</em> consumption causes an estimated 4% of the global disease burden, prompting
#> 2 Background: The negative influences of <em>alcohol</em> on TB management with regard to delays in seeking

Verbosity

As you have noticed, we include in each function the acutal call to the Solr endpoint made so you know exactly what was submitted to the remote or local Solr instance. You can suppress the message with verbose=FALSE. This message isn’t in the CRAN version.

Advanced: Function Queries

Function Queries allow you to query on actual numeric fields in the SOLR database, and do addition, multiplication, etc on one or many fields to stort results. For example, here, we search on the product of counter_total_all and alm_twitterCount, using a new temporary field “val”

solr_search(q='_val_:"product(counter_total_all,alm_twitterCount)"', rows=5, fl='id,title', fq='doc_type:full', base=url)
#>                             id
#> 1 10.1371/journal.pmed.0020124
#> 2 10.1371/journal.pone.0105948
#> 3 10.1371/journal.pone.0046362
#> 4 10.1371/journal.pone.0069841
#> 5 10.1371/journal.pbio.1001535
#>                                                                                                title
#> 1                                                     Why Most Published Research Findings Are False
#> 2 Sliding Rocks on Racetrack Playa, Death Valley National Park: First Observation of Rocks in Motion
#> 3 The Power of Kawaii: Viewing Cute Images Promotes a Careful Behavior and Narrows Attentional Focus
#> 4                            Facebook Use Predicts Declines in Subjective Well-Being in Young Adults
#> 5                                                     An Introduction to Social Media for Scientists

Here, we search for the papers with the most citations

solr_search(q='_val_:"max(counter_total_all)"', rows=5, fl='id,counter_total_all', fq='doc_type:full', base=url)
#>                             id counter_total_all
#> 1 10.1371/journal.pmed.0020124           1002083
#> 2 10.1371/journal.pmed.0050045            324559
#> 3 10.1371/journal.pone.0007595            315117
#> 4 10.1371/journal.pone.0033288            305965
#> 5 10.1371/journal.pone.0069841            277609

Or with the most tweets

solr_search(q='_val_:"max(alm_twitterCount)"', rows=5, fl='id,alm_twitterCount', fq='doc_type:full', base=url)
#>                             id alm_twitterCount
#> 1 10.1371/journal.pone.0061981             2298
#> 2 10.1371/journal.pmed.0020124             1700
#> 3 10.1371/journal.pbio.1001535             1624
#> 4 10.1371/journal.pone.0046362             1368
#> 5 10.1371/journal.pmed.1001747             1361