November 8, 2017
Nearly 4 years ago I wrote on this blog about an R package solr for working with the database Solr. Since then we’ve created a refresh of that package in the solrium package. Since solrium
first hit CRAN about two years ago, users have raised a number of issues that required breaking changes. Thus, this blog post is about a major version bump in solrium
.
Solr is a “search platform” - a NoSQL database - data is organized by so called documents that are xml/json/etc blobs of text. Documents are nested within either collections or cores (depending on the mode you start Solr in). Solr makes it easy to search for documents, with a huge variety of parameters, and a number of different data formats (json/xml/csv). Solr is similar to Elasticsearch (see our Elasticsearch client elastic) - and was around before it. Solr in my opinion is harder to setup than Elasticsearch, but I don’t claim to be an expert on either.
all
to combine all of those.Install solrium
install.packages("solrium")
Or get the development version:
devtools::install_github("ropensci/solrium")
library(solrium)
A big change in v1
of solrium
is solr_connect
has been replaced by SolrClient
. Now you create an R6
connection object with SolrClient
, then you can call methods on that R6
object, OR you can pass the connection object to functions.
By default, SolrClient$new()
sets connections details for a Solr instance that’s running on localhost
, and on port 8983
.
(conn <- SolrClient$new())
#> <Solr Client>
#> host: 127.0.0.1
#> path:
#> port: 8983
#> scheme: http
#> errors: simple
#> proxy:
On instantiation, it does not check that the Solr instance is up, but merely sets connection details. You can check if the instance is up by doing for example (assuming you have a collection named gettingstarted
):
conn$ping("gettingstarted")
#> $responseHeader
#> $responseHeader$zkConnected
#> [1] TRUE
#>
#> $responseHeader$status
#> [1] 0
#>
#> $responseHeader$QTime
#> [1] 163
#>
#> $responseHeader$params
#> $responseHeader$params$q
#> [1] "{!lucene}*:*"
#>
#> $responseHeader$params$distrib
#> [1] "false"
#>
#> $responseHeader$params$df
#> [1] "_text_"
#>
#> $responseHeader$params$rows
#> [1] "10"
#>
#> $responseHeader$params$wt
#> [1] "json"
#>
#> $responseHeader$params$echoParams
#> [1] "all"
#>
#>
#>
#> $status
#> [1] "OK"
A good hint when connecting to a publicly exposed Solr instance is that you likely don’t need to specify a port, so a pattern like this should work to connect to a URL like http://foobar.com/search
:
SolrClient$new(host = "foobar.com", path = "search", port = NULL)
If the instance uses SSL, simply specify that like:
SolrClient$new(host = "foobar.com", path = "search", port = NULL, scheme = "https")
Another big change in the package is that we wanted to make it easy to determine whether your Solr query gets passed as query parameters in a GET
request or as body in a POST
request. Solr clients in some other languages do this, and it made sense to port over that idea here. Now you pass your key-value pairs to either params
or body
. If nothing is passed to body
, we do a GET
request. If something is passed to body
we do a POST
request, even if there’s also key-value pairs passed to params
.
This change does break the interface we had in the old version, but we think it’s worth it.
For example, to do a search you have to pass the collection name and a list of named parameters:
conn$search(name = "gettingstarted", params = list(q = "*:*"))
#> # A tibble: 5 x 5
#> id title title_str `_version_` price
#> <chr> <chr> <chr> <dbl> <int>
#> 1 10 adfadsf adfadsf 1.582913e+18 NA
#> 2 12 though though 1.582913e+18 NA
#> 3 14 animals animals 1.582913e+18 NA
#> 4 1 <NA> <NA> 1.582913e+18 100
#> 5 2 <NA> <NA> 1.582913e+18 500
You can instead pass the connection object to solr_search
:
solr_search(conn, name = "gettingstarted", params = list(q = "*:*"))
#> # A tibble: 5 x 5
#> id title title_str `_version_` price
#> <chr> <chr> <chr> <dbl> <int>
#> 1 10 adfadsf adfadsf 1.582913e+18 NA
#> 2 12 though though 1.582913e+18 NA
#> 3 14 animals animals 1.582913e+18 NA
#> 4 1 <NA> <NA> 1.582913e+18 100
#> 5 2 <NA> <NA> 1.582913e+18 500
And the same pattern applies for the other functions:
solr_facet
solr_group
solr_mlt
solr_highlight
solr_stats
solr_all
A user requested the ability to do atomic updates - partial updates to documents without having to re-index the entire document.
Two functions were added: update_atomic_json
and update_atomic_xml
for JSON and XML based updates. Check out their help pages for usage.
solr_search
and solr_all
in v1
gain attributes that include numFound
, start
, and maxScore
. That is, you can get to these three values after data is returned. Note that some Solr instances may not return all three values.
For example, let’s use the Public Library of Science Solr search instance at https://api.plos.org/search:
plos <- SolrClient$new(host = "api.plos.org", path = "search", port = NULL)
Search
res <- plos$search(params = list(q = "*:*"))
Get attributes
attr(res, "numFound")
#> [1] 1902279
attr(res, "start")
#> [1] 0
attr(res, "maxScore")
#> [1] 1
A user higlighted that there’s a performance penalty when asking for too many rows. The resulting change in solrium
is that in some search functions we automatically adjust the rows
parameter to avoid the performance penalty.
I maintain 4 other packages that use solrium
: rplos, ritis, rdatacite, and rdryad. If you are interested in using solrium
in your package, looking at any of those four packages will give a good sense of how to do it.
The solr
package will soon be archived on CRAN. We’ve moved all packages depending on it to solrium
. Let me know ASAP if you have any complaints about archiving it on CRAN.
Please do upgrade/install solrium
v1
and let us know what you think.