rOpenSci | A package for tidying nested lists

A package for tidying nested lists

Data == knowledge! Much of the data we use, whether it be from government repositories, social media, GitHub, or e-commerce sites comes from public-facing APIs. The quantity of data available is truly staggering, but munging JSON output into a format that is easily analyzable in R is an equally staggering undertaking. When JSON is turned into an R object, it usually becomes a deeply nested list riddled with missing values that is difficult to untangle into a tidy format. Moreover, every API presents its own challenges; code you’ve written to clean up data from GitHub isn’t necessarily going to work on Twitter data, as each API spews data out in its own unique, headache-inducing nested list structure. To ease and generalize this process, Amanda Dobbyn proposed an unconf18 project for a general API response tidier! Welcome roomba, our first stab at easing the process of tidying nested lists!

Drawing

roomba will eventually be able to walk nested lists in a variety of different structures from JSON output, replace NULL or .empty values with NAs or a user-specified value, and return a tibble with names matching a user-specified list. Of course, in two days we haven’t fully achieved this vision, but we’re off to a promising start.

The birth of roomba

It was clear Amanda was on to something good by the lively discussion in the #runconf18 issues repository leading up to the unconf. Thanks to input from Jenny Bryan, Jim Hester, Carl Boettinger, Scott Chamberlain, Bob Rudis, and Noam Ross, we had a lot of ideas to work with when the unconf began. Fortunately, Jim already had a function called dfs_idx() (here) written to perform depth-first searches of nested lists from the GitNub GraphQL API. With the core list-traversal code out of the way, we split our efforts between developing a usable interface, stockpiling .JSON files to test on, and developing a Shiny app.

What’s working

We’ve got the basic structure of roomba sorted out, and you should install it from GitHub to try out! Here are a few of the examples we’ve put together.

library(roomba)
#load twitter data example
data(twitter_data)

#roomba-fy!
roomba(twitter_data, c("created_at", "name"))

## # A tibble: 24 x 2
##    name                 created_at                    
##    <chr>                <chr>                         
##  1 Code for America     Mon Aug 10 18:59:29 +0000 2009
##  2 Ben Lorica <U+7F57><U+745E><U+5361>    Mon Dec 22 22:06:18 +0000 2008
##  3 Dan Sholler          Thu Apr 03 20:09:24 +0000 2014
##  4 Code for America     Mon Aug 10 18:59:29 +0000 2009
##  5 FiveThirtyEight      Tue Jan 21 21:39:32 +0000 2014
##  6 Digital Impact       Wed Oct 07 21:10:53 +0000 2009
##  7 Drew Williams        Thu Aug 07 18:41:29 +0000 2014
##  8 joe                  Fri May 29 13:25:25 +0000 2009
##  9 Data Analysts 4 Good Wed May 07 16:55:33 +0000 2014
## 10 Ryan Frederick       Sun Mar 01 19:06:53 +0000 2009
## # ... with 14 more rows

And just the first element of the twitter_data list will show you that roomba has simplified this process quite a bit.

twitter_data[[1]]

## $created_at
## [1] "Mon May 21 17:58:09 +0000 2018"
## 
## $id
## [1] 9.98624e+17
## 
## $id_str
## [1] "998623997397876743"
## 
## $text
## [1] "Could a program like food stamps have a Cambridge Analytica moment? How do we allow for the innovation that data pl… https://t.co/7tVf1qmNmq"
## 
## $truncated
## [1] TRUE
## 
## $entities
## $entities$hashtags
## list()
## 
## $entities$symbols
## list()
## 
## $entities$user_mentions
## list()
## 
## $entities$urls
## $entities$urls[[1]]
## $entities$urls[[1]]$url
## [1] "https://t.co/7tVf1qmNmq"
## 
## $entities$urls[[1]]$expanded_url
## [1] "https://twitter.com/i/web/status/998623997397876743"
## 
## $entities$urls[[1]]$display_url
## [1] "twitter.com/i/web/status/9…"
## 
## $entities$urls[[1]]$indices
## $entities$urls[[1]]$indices[[1]]
## [1] 117
## 
## $entities$urls[[1]]$indices[[2]]
## [1] 140
## 
## 
## 
## 
## 
## $source
## [1] "<a href=\"https://about.twitter.com/products/tweetdeck\" rel=\"nofollow\">TweetDeck</a>"
## 
## $in_reply_to_status_id
## NULL
## 
## $in_reply_to_status_id_str
## NULL
## 
## $in_reply_to_user_id
## NULL
## 
## $in_reply_to_user_id_str
## NULL
## 
## $in_reply_to_screen_name
## NULL
## 
## $user
## $user$id
## [1] 64482503
## 
## $user$id_str
## [1] "64482503"
## 
## $user$name
## [1] "Code for America"
## 
## $user$screen_name
## [1] "codeforamerica"
## 
## $user$location
## [1] "San Francisco, California"
## 
## $user$description
## [1] "Government can work for the people, by the people, in the 21st century. Help us make it so."
## 
## $user$url
## [1] "https://t.co/l9lokka0rJ"
## 
## $user$entities
## $user$entities$url
## $user$entities$url$urls
## $user$entities$url$urls[[1]]
## $user$entities$url$urls[[1]]$url
## [1] "https://t.co/l9lokka0rJ"
## 
## $user$entities$url$urls[[1]]$expanded_url
## [1] "http://codeforamerica.org"
## 
## $user$entities$url$urls[[1]]$display_url
## [1] "codeforamerica.org"
## 
## $user$entities$url$urls[[1]]$indices
## $user$entities$url$urls[[1]]$indices[[1]]
## [1] 0
## 
## $user$entities$url$urls[[1]]$indices[[2]]
## [1] 23
## 
## 
## 
## 
## 
## $user$entities$description
## $user$entities$description$urls
## list()
## 
## 
## 
## $user$protected
## [1] FALSE
## 
## $user$followers_count
## [1] 49202
## 
## $user$friends_count
## [1] 1716
## 
## $user$listed_count
## [1] 2659
## 
## $user$created_at
## [1] "Mon Aug 10 18:59:29 +0000 2009"
## 
## $user$favourites_count
## [1] 4490
## 
## $user$utc_offset
## [1] -25200
## 
## $user$time_zone
## [1] "Pacific Time (US & Canada)"
## 
## $user$geo_enabled
## [1] TRUE
## 
## $user$verified
## [1] TRUE
## 
## $user$statuses_count
## [1] 15912
## 
## $user$lang
## [1] "en"
## 
## $user$contributors_enabled
## [1] FALSE
## 
## $user$is_translator
## [1] FALSE
## 
## $user$is_translation_enabled
## [1] FALSE
## 
## $user$profile_background_color
## [1] "EBEBEB"
## 
## $user$profile_background_image_url
## [1] "http://abs.twimg.com/images/themes/theme7/bg.gif"
## 
## $user$profile_background_image_url_https
## [1] "https://abs.twimg.com/images/themes/theme7/bg.gif"
## 
## $user$profile_background_tile
## [1] FALSE
## 
## $user$profile_image_url
## [1] "http://pbs.twimg.com/profile_images/615534833645678592/iAO_Lytr_normal.jpg"
## 
## $user$profile_image_url_https
## [1] "https://pbs.twimg.com/profile_images/615534833645678592/iAO_Lytr_normal.jpg"
## 
## $user$profile_banner_url
## [1] "https://pbs.twimg.com/profile_banners/64482503/1497895952"
## 
## $user$profile_link_color
## [1] "CF1B41"
## 
## $user$profile_sidebar_border_color
## [1] "FFFFFF"
## 
## $user$profile_sidebar_fill_color
## [1] "F3F3F3"
## 
## $user$profile_text_color
## [1] "333333"
## 
## $user$profile_use_background_image
## [1] FALSE
## 
## $user$has_extended_profile
## [1] FALSE
## 
## $user$default_profile
## [1] FALSE
## 
## $user$default_profile_image
## [1] FALSE
## 
## $user$following
## [1] TRUE
## 
## $user$follow_request_sent
## [1] FALSE
## 
## $user$notifications
## [1] FALSE
## 
## $user$translator_type
## [1] "none"
## 
## 
## $geo
## NULL
## 
## $coordinates
## NULL
## 
## $place
## NULL
## 
## $contributors
## NULL
## 
## $is_quote_status
## [1] FALSE
## 
## $retweet_count
## [1] 0
## 
## $favorite_count
## [1] 0
## 
## $favorited
## [1] FALSE
## 
## $retweeted
## [1] FALSE
## 
## $possibly_sensitive
## [1] FALSE
## 
## $possibly_sensitive_appealable
## [1] FALSE
## 
## $lang
## [1] "en"

We created a Shiny app too, which in its current state allows you to select a .Rda or .JSON file, pick two variables, and create a scatterplot of them.

Run the app like this:

shiny_roomba()

What’s not

Of course, in two days we weren’t able to build a magical one-size-fits-all solution to every API response data headache. Right now, the main barrier to usability is that both the roomba() function and shiny_roomba() app only work on sub-list items of the same length and same data type stored at the same depth. To illustrate on the twitter_data:

#This doesn't work because "user" has data of different types and lengths
roomba(twitter_data, c("user"))

## # A tibble: 1,007 x 1
##    user      
##    <list>    
##  1 <int [1]> 
##  2 <chr [1]> 
##  3 <chr [1]> 
##  4 <chr [1]> 
##  5 <chr [1]> 
##  6 <chr [1]> 
##  7 <chr [1]> 
##  8 <list [2]>
##  9 <lgl [1]> 
## 10 <int [1]> 
## # ... with 997 more rows

#This doesn't work because "name" and "retweet_count" are at different depths.
roomba(twitter_data, c("name","retweet_count"))

## # A tibble: 0 x 0

In addition, we’ve got some features we want to add, such as handling a larger variety of column names (i.e. passing a string for a single column name, keeping all values even if they are all NULL). We would love your feedback on other things we can add (open an issue in our Git repository)!

The team

Amanda Dobbyn

Job: Data Scientist at Earlybird Software

Project contributions: initial GH issue, package name, wrapper for dfs_idx()

Jim Hester

Job: Software Engineer at RStudio

Project contributions: dfs_idx() and remove_nulls() functions, package building, README, and debugging

Christine Stawitz

Job: Postdoctoral researcher at University of Washington’s School of Aquatic and Fishery Sciences

Project contributions: Shiny app, README and blog post writing

Laura DeCicco

Job: Data Scientist at U.S. Geological Survey

Project contributions: Fixing merge conflicts :)

Isabella Velasquez

Job: Data Analyst at the Bill & Melinda Gates Foundation

Project contributions: hex sticker!