April 26, 2018
Our onboarding reviews, that ensure that packages contributed by the community undergo a transparent, constructive, non adversarial and open review process, take place in the issue tracker of a GitHub repository. Development of the packages we onboard also takes place in the open, most often in GitHub repositories.
Therefore, when wanting to get data about our onboarding system for giving a data-driven overview, my mission was to extract data from GitHub and git repositories, and to put it into nice rectangles (as defined by Jenny Bryan) ready for analysis. You might call that the first step of a “tidy git analysis” using the term coined by Simon Jackson. So, how did I collect data?
In the following, I’ll mention repositories. All of them are git repositories, which means they’re folders under version control, where roughly said all changes are saved via commits and their messages (more or less) describing what’s been changed in the commit. Now, on top of that these repositories live on GitHub which means they get to enjoy some infratructure such as issue trackers, milestones, starring by admirers, etc. If that ecosystem is brand new to you, I recommend reading this book, especially its big picture chapter.
Each package submission is an issue thread in our onboarding repository, see an example here. The first comment in that issue is the submission itself, followed by many comments by the editor, reviewers and authors. On top of all the data that’s saved there, mostly text data, we have a private Airtable workspace where we have a table of reviewers and their reviews, with direct links to the issue comments that are reviews.
Unsurprisingly, the first step here was to “get issue threads”. What do I mean? I wanted a table of all issue threads, one line per comment, with columns indicating the time at which something was written, and columns digesting the data from the issue itself, e.g. guessing the role from the commenter from other information: the first user of the issue is the “author”.
I used to use GitHub API V3 and then heard about GitHub API V4 which blew my mind. As if I weren’t impressed enough by the mere existence of this API and its advantages,
I discovered the rOpenSci ghql
package allows one to interact
with such an API and that its docs actually use GitHub API V4 as an
example!
Carl Boettiger told me about his way to rectangle JSON
data,
using jq, a language for
processing JSON, via a dedicated rOpenSci package,
jqr
.
I have nothing against GitHub API V3 and
gh
and purrr
workflows, but I was
curious and really enjoyed learning these new tools and writing this
code. I had written a gh
/purrr
code for getting the same information
and it felt clumsier, but it might just be because I wasn’t
perfectionist enough when writing it! I achieved writing the correct
GitHub V4 API query to get just what I needed by using its online
explorer. I then succeeded
in transforming the JSON output into a rectangle by reading Carl’s post
but also by taking advantage of another online explorer, jq
play where I pasted my output via
writeClipboard
. That’s nearly always the way I learn about query
tools: using some sort of explorer and then pasting the code into a
script. When I am more experienced, I can skip the explorer part.
The first function I wrote was one for getting the issue number of the last onboarding issue, because then I looped/mapped over all issues.
library("ghql")
library("httr")
library("magrittr")
# function to get number of last issue
get_last_issue <- function(){
query = '{
repository(owner: "ropensci", name: "onboarding") {
issues(last: 1) {
edges{
node{
number
}
}
}
}
}'
token <- Sys.getenv("GITHUB_GRAPHQL_TOKEN")
cli <- GraphqlClient$new(
url = "https://api.github.com/graphql",
headers = add_headers(Authorization = paste0("Bearer ", token))
)
## define query
### creat a query class first
qry <- Query$new()
qry$query('issues', query)
last_issue <-cli$exec(qry$queries$issues)
last_issue %>%
jqr::jq('.data.repository.issues.edges[].node.number') %>%
as.numeric()
}
get_last_issue()
## [1] 201
Then I wrote a function for getting all the precious info I needed from
an issue thread. At the time it lived on its own in an R script, now
it’s gotten included in my ghrecipes
package as
get_issue_thread
so you can check out the code there, along with other useful recipes for
analyzing GitHub data.
Then I launched this code to get all data! It was very satisfying.
#get all threads
issues <- purrr::map_df(1:get_last_issue(), get_issue_thread)
# for the one(s) with 101 comments get the 100 last comments
long_issues <- issues %>%
dplyr::count(issue) %>%
dplyr::filter(n == 101) %>%
dplyr::pull(issue)
issues2 <- purrr::map_df(long_issues, get_issue_thread, first = FALSE)
all_issues <- dplyr::bind_rows(issues, issues2)
all_issues <- unique(all_issues)
readr::write_csv(all_issues, "data/all_threads_v4.csv")
In the previous step we got a rectangle of all threads, with information from the first issue comment (such as labels) distributed to all the comments of the threads.
issues <- readr::read_csv("data/all_threads_v4.csv")
issues <- janitor::clean_names(issues)
issues <- dplyr::rename(issues, user = author)
issues <- dplyr::select(issues, - dplyr::contains("topic"))
issues %>%
head() %>%
dplyr::select(- body) %>%
knitr::kable()
title | author_association | assignee | created_at | closed_at | user | comment_url | package | pulled | issue | meta | x6_approved | out_of_scope | x4_review_s_in_awaiting_changes | x0_presubmission | question | x3_reviewer_s_assigned | holding | legacy | x1_editor_checks | x5_awaiting_reviewer_s_response | x2_seeking_reviewer_s |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rrlite | OWNER | sckott | 2015-03-10 23:22:45 | 2015-03-31 00:16:28 | richfitz | NA | TRUE | TRUE | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
rrlite | OWNER | sckott | 2015-03-10 23:26:11 | 2015-03-31 00:16:28 | richfitz | https://github.com/ropensci/software-review/issues/1#issuecomment-78170639 | TRUE | TRUE | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
rrlite | OWNER | sckott | 2015-03-11 19:29:32 | 2015-03-31 00:16:28 | karthik | https://github.com/ropensci/software-review/issues/1#issuecomment-78351979 | TRUE | TRUE | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
rrlite | OWNER | sckott | 2015-03-11 21:08:59 | 2015-03-31 00:16:28 | sckott | https://github.com/ropensci/software-review/issues/1#issuecomment-78372187 | TRUE | TRUE | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
rrlite | OWNER | sckott | 2015-03-11 21:13:11 | 2015-03-31 00:16:28 | karthik | https://github.com/ropensci/software-review/issues/1#issuecomment-78373054 | TRUE | TRUE | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
rrlite | OWNER | sckott | 2015-03-11 21:33:45 | 2015-03-31 00:16:28 | richfitz | https://github.com/ropensci/software-review/issues/1#issuecomment-78377124 | TRUE | TRUE | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Now we need a few steps more:
transforming NA into FALSE for variables corresponding to labels,
getting the package name from Airtable since the titles of issues are not uniformly formatted,
knowing which comment is a review,
deducing the role of the user writing the comment (author/editor/reviewer/community manager/other).
Below binary variables are transformed and only rows corresponding to approved packages are kept.
# labels
replace_1 <- function(x){
!is.na(x[1])
}
# binary variables
ncol_issues <- ncol(issues)
issues <- dplyr::group_by(issues, issue) %>%
dplyr::arrange(created_at) %>%
dplyr::mutate_at(9:(ncol_issues-1), replace_1) %>%
dplyr::ungroup()
# keep only issues that are finished
issues <- dplyr::filter(issues, package, !x0_presubmission,
!out_of_scope, !legacy,
!x1_editor_checks, x6_approved)
issues <- dplyr::select(issues, - dplyr::starts_with("x"),
- package, - out_of_scope, - legacy,
- meta, - holding, - pulled, - question)
Then, thanks to the airtabler
package we can add the name of the
package, and identify review comments.
# airtable data
airtable <- airtabler::airtable("appZIB8hgtvjoV99D", "Reviews")
airtable <- airtable$Reviews$select_all()
airtable <- dplyr::mutate(airtable,
issue = as.numeric(stringr::str_replace(onboarding_url,
".*issues\\/", "")))
# we get the name of the package
# and we know which comments are reviews
reviews <- dplyr::select(airtable, review_url, issue, package) %>%
dplyr::mutate(is_review = TRUE)
issues <- dplyr::left_join(issues, reviews, by = c("issue", "comment_url" = "review_url"))
issues <- dplyr::mutate(issues, is_review = !is.na(is_review))
Finally, the non elegant code below attributes a role to each user
(commenter is its more precise version that differentiates reviewer 1
from reviewer 2). I could have used dplyr
case_when
.
# non elegant code to guess role
issues <- dplyr::group_by(issues, issue)
issues <- dplyr::arrange(issues, created_at)
issues <- dplyr::mutate(issues, author = user[1])
issues <- dplyr::mutate(issues, package = unique(package[!is.na(package)]))
issues <- dplyr::mutate(issues, assignee = assignee[1])
issues <- dplyr::mutate(issues, reviewer1 = ifelse(!is.na(user[is_review][1]), user[is_review][1], ""))
issues <- dplyr::mutate(issues, reviewer2 = ifelse(!is.na(user[is_review][2]), user[is_review][2], ""))
issues <- dplyr::mutate(issues, reviewer3 = ifelse(!is.na(user[is_review][3]), user[is_review][3], ""))
issues <- dplyr::ungroup(issues)
issues <- dplyr::group_by(issues, issue, created_at, user)
# regexp because in at least 1 case assignee = 2 names glued together
issues <- dplyr::mutate(issues, commenter = ifelse(stringr::str_detect(assignee, user), "editor", "other"))
issues <- dplyr::mutate(issues, commenter = ifelse(user == author, "author", commenter))
issues <- dplyr::mutate(issues, commenter = ifelse(user == reviewer1, "reviewer1", commenter))
issues <- dplyr::mutate(issues, commenter = ifelse(user == reviewer2, "reviewer2", commenter))
issues <- dplyr::mutate(issues, commenter = ifelse(user == reviewer3, "reviewer3", commenter))
issues <- dplyr::mutate(issues, commenter = ifelse(user == "stefaniebutland", "community_manager", commenter))
issues <- dplyr::ungroup(issues)
issues <- dplyr::mutate(issues, role = commenter,
role = ifelse(stringr::str_detect(role, "reviewer"),
"reviewer", role))
issues <- dplyr::select(issues, - author, - reviewer1, - reviewer2, - reviewer3, - assignee,
- author_association, - comment_url)
readr::write_csv(issues, "data/clean_data.csv")
The role “other” corresponds to anyone chiming in, while the community manager role is planning blog posts with the package author. We indeed have a series of guest blog posts from package authors that illustrate the review process as well as their onboarded packages.
Here is the final table. I unselect “body” because formatting in the text could break the output here, but I do have the text corresponding to each comment.
issues %>%
dplyr::select(- body) %>%
head() %>%
knitr::kable()
title | created_at | closed_at | user | issue | package | is_review | commenter | role |
---|---|---|---|---|---|---|---|---|
rrlite | 2015-03-31 00:25:14 | 2015-04-13 23:26:38 | richfitz | 6 | rrlite | FALSE | author | author |
rrlite | 2015-04-01 17:30:51 | 2015-04-13 23:26:38 | sckott | 6 | rrlite | FALSE | editor | editor |
rrlite | 2015-04-01 17:36:03 | 2015-04-13 23:26:38 | karthik | 6 | rrlite | FALSE | other | other |
rrlite | 2015-04-02 03:36:09 | 2015-04-13 23:26:38 | jeroen | 6 | rrlite | FALSE | reviewer2 | reviewer |
rrlite | 2015-04-02 03:50:43 | 2015-04-13 23:26:38 | gaborcsardi | 6 | rrlite | FALSE | other | other |
rrlite | 2015-04-02 03:53:57 | 2015-04-13 23:26:38 | richfitz | 6 | rrlite | FALSE | author | author |
There are 2521 comments, corresponding to 70 onboarded packages.
As mentioned earlier, onboarded packages are most often developped on GitHub. After onboarding they live in the ropensci GitHub organization, previously some of them were onboarded into ropenscilabs but they should all be transferred soon. In any case, their being on GitHub means it’s possible to get their history to have a glimpse at work represented by onboarding!
Using rOpenSci git2r
package I
cloned all onboarded repositories in a “repos” folder. Since I didn’t
know which package was in ropensci or ropenscilabs, I tried both.
airtable <- airtabler::airtable("appZIB8hgtvjoV99D", "Reviews")
airtable <- airtable$Reviews$select_all()
safe_clone <- purrr::safely(git2r::clone)
# github link either ropensci or ropenscilabs
clone_repo <- function(package_name){
print(package_name)
url <- paste0("https://github.com/ropensci/", package_name, ".git")
local_path <- paste0(getwd(), "/repos/", package_name)
clone_from_ropensci <- safe_clone(url = url, local_path = local_path,
progress = FALSE)
if(is.null(clone_from_ropensci$result)){
url <- paste0("https://github.com/ropenscilabs/", package_name, ".git")
clone_from_ropenscilabs <- safe_clone(url = url, local_path = local_path,
progress = FALSE)
if(is.null(clone_from_ropenscilabs$result)){
message("OUILLE")
}
}
}
pkgs <- unique(airtable$package)
pkgs <- pkgs[!pkgs %in% fs::dir_ls()]
pkgs <- pkgs[pkgs != "rrricanes"]
purrr::walk(pkgs, clone_repo)
I didn’t clone “rrricanes” because it was too big!
I then got the commit logs of each repo for various reasons:
commits themselves show how much code and documentation editing was done during review
I wanted to be able to git reset hard
the repo at its state at
submission, for which I needed the commit logs.
I used the gitsum
package to get commit
logs because its dedicated high-level functions made it easier than with
git2r
.
library("magrittr")
get_report <- function(package_name){
message(package_name)
local_path <- paste0(getwd(), "/repos/", package_name)
if(length(fs::dir_ls(local_path)) != 0){
gitsum::init_gitsum(local_path, over_write = TRUE)
report <- gitsum::parse_log_detailed(local_path)
report <- dplyr::select(report, - nested)
report$package <- package_name
if(!"datetime" %in% names(report)){
report <- dplyr::mutate(report,
hour = as.numeric(stringr::str_sub(timezone, 1, 3)),
minute = as.numeric(stringr::str_sub(timezone, 4, 5)),
datetime = date + lubridate::hours(-1 * hour) + lubridate::minutes(-1 * minute))
report <- dplyr::select(report, - hour, - minute, - timezone)
}
report <- dplyr::select(report, - date)
return(report)
}else{
return(NULL)
}
}
packages <- fs::dir_ls("repos")
packages <- stringr::str_replace_all(packages, "repos\\/", "")
purrr::map_df(packages, get_report) %>%
readr::write_csv("output/gitsum_reports.csv")
Crossing information from the issue threads and from commit logs, I could find the latest commit before submission and create a copy of each repo before resetting it at this state. This is the closest to a Time-Turner that I have!
library("magrittr")
# get issues opening datetime
issues <- readr::read_csv("data/clean_data.csv")
issues <- dplyr::group_by(issues, package)
issues <- dplyr::summarise(issues, opened = min(created_at))
# now for each package keep only commits before that
commits <- readr::read_csv("output/gitsum_reports.csv")
commits <- dplyr::left_join(commits, issues, by = "package")
commits <- dplyr::group_by(commits, package)
commits <- dplyr::filter(commits, datetime <= opened)
# and from them keep the latest one,
# that's the latest commit before submission!
commits <- dplyr::filter(commits, datetime == max(datetime), !is_merge)
commits <- dplyr::summarize(commits, hash = hash[1])
# small helper function
get_sha <- function(commit){
commit@sha
}
set_archive <- function(package_name, commit){
message(package_name)
# copy the entire repo to another location
local_path <- paste0(getwd(), "/repos/", package_name)
local_path_archive <- paste0(getwd(), "/repos_at_submission/", package_name)
fs::dir_copy(local_path, local_path_archive)
# get all commits -- it's fast which is why I don't use gitsum report here
commits <- git2r::commits(git2r::repository(local_path_archive))
# get their sha
sha <- purrr::map_chr(commits, get_sha)
# all of this to extract the commit with the sha of the latest commit before submission
# in other words the latest commit before submission
commit <- commits[sha == commit][[1]]
# do a hard reset at that commit
git2r::reset(commit, reset_type = "hard")
}
purrr::walk2(commits$package, commits$hash, set_archive)
There’s more data to be collected or prepared! From GitHub issues, using GitHub archive one could get the labelling history: when did an issue go from “editor-checks” to “seeking-reviewers” for instance? It’d help characterize the usual speed of the process. One could also try to investigate the formal and less formal links between the onboarded repository and the review: did commits and issues mention the onboarding review (with words), or even actually put a link to it? Are actors in the process little or very active on GitHub for other activities, e.g. could we see that some reviewers create or revive their GitHub account especially for reviewing?
Rather than enlarging my current dataset, I’ll present its analysis in
two further blog posts answering the questions “How much work is
rOpenSci onboarding?” and “How to characterize the social weather of
rOpenSci onboarding?”. In case you’re too impatient, in the meantime you
can dive into this blog post by Augustina Ragwitz about measuring
open-source influence beyond
commits
and this one by rOpenSci co-founder Scott Chamberlain about exploring
git commits with git2r
.