March 18, 2019
Version 7.0.0 of drake
just arrived on CRAN, and it is faster and easier to use than previous releases.
install.packages("drake")
Data analysis can be slow. A round of scientific computation can take several minutes, hours, or even days to complete. After it finishes, if you update your code or data, your hard-earned results may no longer be valid. How much of that valuable output can you keep, and how much do you need to update? How much runtime must you endure all over again?
For projects in R, the workflow package drake
can help. It analyzes your project, skips steps with up-to-date results, and orchestrates the rest with optional distributed computing. At the end, drake
provides evidence that your results match the underlying code and data, which increases your ability to trust your research.
Other pipeline tools such as Make, Snakemake, Airflow, Luigi, and Dask accomplish similar goals, but drake
is uniquely suited to R. It features
drake
’s DSL is still experimental, but it is more powerful than old interface functions such as evaluate_plan()
. A new section of the manual walks through the details, and the revised gross state product example is a motivating use case.
Three transformations form the core of the new syntax.
Transformation | Tidyverse equivalent |
---|---|
map() |
pmap() from purrr |
cross() |
crossing() from tidyr |
combine() |
summarize() from dplyr |
These transformations help declare batches of multiple targets.
library(drake)
# short names to tell get_data() what to download:
source <- c("census", "gapminder")
plan <- drake_plan(
data = target(
get_data(from), # custom function
transform = map(from = !!source, .id = FALSE)
),
run = target(
analysis_function(data),
transform = cross(data, analysis_function = c(lda, rf))
),
out = target(
my_summaries(run),
transform = combine(run, .by = analysis_function)
)
)
plan
#> # A tibble: 8 x 2
#> target command
#> <chr> <expr>
#> 1 data get_data("census")
#> 2 data_2 get_data("gapminder")
#> 3 run_lda_data lda(data)
#> 4 run_rf_data rf(data)
#> 5 run_lda_data_2 lda(data_2)
#> 6 run_rf_data_2 rf(data_2)
#> 7 out_lda my_summaries(run_lda_data, run_lda_data_2)
#> 8 out_rf my_summaries(run_rf_data, run_rf_data_2)
Above, note the <expr>
label underneath the command
header. For the sake of faster and cleaner metaprogramming, plan$command
is now a list of expressions by default. (However, you can still supply a character vector if you wish.)
After serious profiling and benchmarking, make()
and drake_config()
are substantially faster in workflows with thousands of targets. The changes affect drake
’s internal storage protocol, so targets built with previous versions are not up to date anymore, but make()
pauses to let you downgrade to an earlier version or start from scratch.
In addition, to increase clarity and ease of use, the parallelism
argument of make()
has fewer choices. Of the previous 11 options, only the best 3 remain. See the updated high-performance computing guide for details.
parallelism |
Functionality |
---|---|
"loop" |
No parallel computing |
"clustermq" |
Persistent workers |
"future" |
Transient workers |
By default, drake
watches your session and environment for dependencies. This behavior frees drake
to fully focus on R, enhancing interactivity, flexibility, and independence from cumbersome configuration files. However, a fully reproducible use of drake
requires care. Version 7.0.0 comes with new workarounds and safeguards.
A serious drake
workflow should be consistent and reliable, ideally with the help of a master R script. Before it builds your targets, this script should begin in a fresh R session and load your packages and functions in a dependable manner. Batch mode helps ensure all this goes according to plan. If you use a single persistent interactive R session to repeatedly invoke make()
while you develop the workflow, then over time, your session could grow stale and accidentally invalidate targets.
To combine interactivity with reproducibility, version 7.0.0 has a new experimental callr
-like interface. Functions such as r_make()
, r_outdated()
, and r_drake_graph_info()
each run drake
in a transient callr
session so that accidental changes to your interactive session do not break your results.
For more information, please see the newly refreshed chapter on drake
projects. For example code, you can download the updated main example (drake_example("main")
) and experiment with files _drake.R
and interactive.R
.
An additional safeguard prevents workflows from invalidating themselves.
plan <- drake_plan(
x = {
data(mtcars)
mtcars$mpg
},
y = mean(x)
)
plan
#> # A tibble: 2 x 2
#> target command
#> <chr> <expr>
#> 1 x { data(mtcars) mtcars$mpg }
#> 2 y mean(x)
make(plan)
#> target x
#> fail x
#> Error: Target `x` failed. Call `diagnose(x)` for details. Error message:
#> cannot add bindings to a locked environment.
#> Please read the "Self-invalidation" section of the make() help file.
The error above comes from the call to data(mtcars)
. Without guardrails, the very act of building x
changes x
's dependencies. In other words, x
is still invalid after make()
completes.
make(plan, lock_envir = FALSE)
#> target x
#> target y
make(plan, lock_envir = FALSE)
#> target x
There are legitimate use cases for lock_envir = FALSE
, but most projects should stick with the default lock_envir = TRUE
.
drake
enhances reproducibility, but not in all respects. Literate programming, local library managers, containerization, and strict session managers offer more robust solutions in their respective domains. Reproducibility encompasses a wide variety of tools and techniques working together.
drake
thrives on active participation from the open source community. Most of the inspiration for version 7.0.0 arose from new GitHub issues and conversations at the 2018 and 2019 RStudio Conferences. Many thanks to the following people for their insight, guidance, and code patches.
This announcement is a product of my own opinions and does not necessarily represent the official views of my employer.