May 3, 2017
On 21-22 April, the London School of Economics hosted the Text Analysis Package Developers’ Workshop, a two-day event held in London that brought together developers of R packages for working with text and text-related data. This included a wide range of applications, including string handling (stringi
) and tokenization (the rOpenSci-onboarded tokenizers
, KoNLP
), corpus and text processing (readtext
, tm
, quanteda
, and qdap
), natural language processing (NLP) such as part of speech and dependency tagging (cleanNLP
, spacyr
), and the statistical analysis of textual data (stm
, text2vec
, and koRpus
) – although this list is hardly complete. The main objective was to bring together experts working on various aspects of text processing and text analysis using R, to discuss common challenges and identify collaborative solutions.
The workshop was semi-structured, somewhere in between a traditional agenda-driven workshop and an “unconference”. On Day 1 and in the first session of Day 2, we held a set of seven topically focused roundtables, with between 3-4 participants plus an appointed moderator. Organized around themes or problems, such as segmenting and tokenizing text, or trying to parse linguistic structure and parts of speech, each session lasted about an hour (the full list is here). This gave everyone a chance to participate in a structured event, and served to guide discussion. This proved invaluable in identifying challenges, especially the final roundtable on Saturday morning to discuss package interoperability, which had been aided greatly by the overall discussion that took place before across the first six roundtables. (A group dinner at a local brewpub the evening before also helped spur this discussion.)
The remainder of Day 2 consisted of people joining a variety of self-organized groups to work on projects we identified in the final roundtable session, and then spending the rest of the day planning and programming as part of these groups. We concluded with brief presentations from each group as to what they had learned, accomplished, and developed. This ranged from drafting interoperability standards for textual data exchange, starting new packages such as textcolor
for displaying highlighted keywords in text, and even some fully functional packages having been completed on the spot (antiword
, with Jeroen giving a masterclass in rapid package development and deployment). It was also a great chance to bug Marek to add new features to stringi
(such as: Unicode 9.0 support; Korean segmentation rules; etc.).
The main sponsor for the workshop was my European Research Council grant ERC-2011-StG 283794-QUANTESS, a five-year project aimed at developing methodologies and tools for the social scientific analysis of textual data. We also enjoyed support from the LSE’s Social and Economic Data Sciences Unit.
Following my eye-opening introduction last year to rOpenSci through the 2016 unconference, I was also eager to adopt the approach of rOpenSci, including a partial unconference format, the embrace of the rOpenSci communication channels, the on-boarding and package peer-review process, and especially the positive and constructive spirit of collaboration. Karthik and Jeroen helped with the workshop webpage and with the Slack team, which we invited all participants to join. Over 10 weeks before the event, we set up a channel #text-sig
and invited all participants to join. We also solicited issues and had pre-conference discussions about those issues through the conference GitHub repository.
Going forward, we anticipate working together on the projects we started, such as the Text Interchange Format, or contributing to common tool packages such as tokenizers
, or launching new packages. We hope that the participants will also take part in the on-boarding process as reviewers or as authors. Finally, through a proof-of-concept of the “special interest group” concept applied to rOpenSci, we hope to serve as a model for the SIG structure, developing a set of guidelines that could apply to other topically focused groups. We also hope to continue the meetings of the text-SIG through future workshops and through continued communication using the Slack channel and the GitHub issues. The enormously positive experience not only deserves to be repeated, but also to include new participants.