April 11, 2017
*Version 2.0 of my data set validation package assertr
hit CRAN just this weekend. It has some pretty great improvements over version 1. For those new to the package, what follows is a short and new introduction. For those who are already using assertr
, the text below will point out the improvements.*
I can (and have) go on and on about the treachery of messy/bad datasets. Though its substantially less exciting than… pretty much everything else, I believe (proportional to the heartache and stress it causes) we don’t spend enough time talking about it or building solutions around it. No matter how new and fancy your ML algorithm is, it’s success is predicated upon a properly sanitized dataset. If you are using bad data, your approach will fail—either flagrantly (best case), or unnoticeably (considerably more probable and considerably more pernicious).
assertr
is a R package to help you identify common dataset errors. More specifically, it helps you easily spell out your assumptions about how the data should look and alert you of any deviation from those assumptions.
I’ll return to this point later in the post when we have more background, but I want to be up front about the goals of the package; assertr
is not (and can never be) a “one-stop shop” for all of your data validation needs. The specific kind of checks individuals or teams have to perform any particular dataset are often far too idiosyncratic to ever be exhaustively addressed by a single package (although, the assertive
meta-package may come very close!) But all of these checks will reuse motifs and follow the same patterns. So, instead, I’m trying to sell assertr
as a way of thinking about dataset validations—a set of common dataset validation actions. If we think of these actions as verbs, you could say that assertr
attempts to impose a grammar of error checking for datasets.
In my experience, the overwhelming majority of data validation tasks fall into only five different patterns:
assertr
calls this verb assert
.assertr
calls this verb insist
.assertr
calls this verb assert_rows
.assert
and insist
, but for entire rows (not individual elements). An example of using this would be checking to make sure that the Mahalanobis distance between each row and all other rows are within n number of standard deviations of the mean distance. assertr
calls this verb insist_rows
.assertr
calls this last verb verify
.Some of this might sound a little complicated, but I promise this is a worthwhile way to look at dataset validation. Now we can begin with an example of what can be achieved with these verbs. The following example is borrowed from the package vignette and README…
Pretend that, before finding the average miles per gallon for each number of engine cylinders in the mtcars
dataset, we wanted to confirm the following dataset assumptions…
mpg
, vs
, and am
am
and vs
columns (automatic/manual and v/straight engine, respectively) contain 0s and 1s onlympg
, am
, and wt
columnslibrary(assertr)
library(magrittr)
library(dplyr)
mtcars %>%
verify(has_all_names("mpg", "vs", "am", "wt")) %>%
verify(nrow(.) > 10) %>%
verify(mpg > 0) %>%
insist(within_n_sds(4), mpg) %>%
assert(in_set(0,1), am, vs) %>%
assert_rows(num_row_NAs, within_bounds(0,2), everything()) %>%
assert_rows(col_concat, is_uniq, mpg, am, wt) %>%
insist_rows(maha_dist, within_n_mads(10), everything()) %>%
group_by(cyl) %>%
summarise(avg.mpg=mean(mpg))
Before assertr
version 2, the pipeline would immediately terminate at the first failure. Sometimes this is a good thing. However, sometimes we’d like to run a dataset through our entire suite of checks and record all failures. The latest version includes the chain_start
and chain_end
functions; all assumptions within a chain (below a call to chain_start
and above chain_end
) will run from beginning to end and accumulate errors along the way. At the end of the chain, a specific action can be taken but the default is to halt execution and display a comprehensive report of what failed including line numbers and the offending datum, where applicable.
Another major improvement since the last version of assertr
on CRAN is that assertr
errors are now S3 classes (instead of dumb strings). Additionally, the behavior of each assertion statement on success (no error) and failure can now be flexibly customized. For example, you can now tell assertr
to just return TRUE and FALSE instead of returning the data passed in or halting execution, respectively. Alternatively, you can instruct assertr
to just give a warning instead of throwing a fatal error. For more information on this, see help("success_and_error_functions")
Since the package was initially published on CRAN (almost exactly two years ago) many people have asked me how they should go about using assertr
to test a particular assumption (and I’m very happy to help if you have one of your own, dear reader!) In every single one of these cases, I’ve been able to express it as an incantation using one of these 5 verbs. It also underscored, to me, that creating specialized functions for every need is a pipe dream. There is, however, two good pieces of news.
The first is that there is another package, assertive that greatly enhances the assertr
experience. The predicates (functions that start with is_
) from this (meta)package can be used in assertr
pipelines just as easily as assertr
’s own predicates. And assertive
has an enormous amount of them! Some specialized and exciting examples include is_hex_color
, is_ip_address
, and is_isbn_code
!
The second is if assertive
doesn’t have what you’re looking for, with just a little bit of studying the assertr
grammar, you can whip up your own predicates with relative ease. Using some these basic constructs and a little effort, I’m confident that the grammar is expressive enough to completely adapt to your needs.
If this package interests you, I urge you to read the most recent package vignette here. If you’re an assertr
old-timer, I point you to this NEWS file that list the changes from the previous version.