rOpenSci | Announcing pdftools 1.0

Announcing pdftools 1.0

This week we released version 1.0 of the ropensci pdftools package to CRAN. Pdftools provides utilities for extracting text, fonts, attachments and other data from PDF files. It also supports rendering of PDF files into bitmap images.

This release has a few internal enhancements and fixes an annoying bug for landscape PDF pages. The version bump to 1.0 signifies that the package has undergone sufficient testing and the API is stable.

Extracting Text

As described in our previous post, the most common use of pdftools is extracting text from (scientific) articles for searching / indexing. But let’s try a somewhat more unusual PDF file this time: a poster.

library(pdftools)
url <- "https://www.rstudio.com/wp-content/uploads/2016/02/advancedR.pdf"

# Display author, editor
pdf_info(url)

The pdf_info file returns all kind of metadata from the pdf file. For example we can read that this PDF was created on 2016-02-12 by Arianne Colton using Acrobat PDFMaker 11 for PowerPoint.

# extract text vector
text <- pdf_text(url)

# Print text from page 1
cat(text[1])

The pdf_text function extracts text into an R character vector if length equal to the number of pages in the PDF. Note how the text is spaced to match the position in the PDF page.

Rendering PDF

Recent versions of pdftools allow rendering of PDF pages into bitmap images. The pdf_render_page function returns the bitmap as a raw vector array of size channels * width * height (in pixels).

library(pdftools)
bitmap <- pdf_render_page(url, page = 1, dpi = 72)
dim(bitmap)
### 4 1100  850

From here we can use for example the rOpenSci magick package to read the bitmap and manipulate/export it to various formats.

library(magick)
poster <- image_read(bitmap)
print(poster)
image_write(poster, "out.png", format = "png")

Or have some fun with the other magick tools :)

# Download dancing banana
banana <- image_read("https://jeroen.github.io/images/banana.gif")
banana <- image_scale(banana, "300")

# Combine and flatten frames
frames <- lapply(banana, function(frame) {
  image_composite(poster, frame, offset = "+70+30")
})

# Turn frames into animation
animation <- image_animate(image_join(frames))
print(animation)

# Save as gif
image_write(animation, "output.gif")