Wednesday, 16 July 2014

Word Clouds with Processing and R

I've previously relied on to produce 'word clouds'. These clouds give greater prominence to words that appear more frequently in a source text. However, it is also possible to produce similar, highly customisable word clouds in Processing and R.

Processing relies on the excellent WordCram library. This is straightforward to install and can take any text file or website and produce a word cloud. The source code below demonstrates the variety of presentation options that are available. While the WordCram extension allows for some nice visual tricks, the text file has to be free of any undesirable additional variables or words that may need be excluded before creating a word cloud.

In other words, Processing doesn't allow for any advanced text mining techniques e.g. you might want to remove all numbers or specific words form a file before producing a word cloud. R becomes more useful in this instance. Combining the tm (text mining) and wordcloud packages* makes for a comprehensive set of tools however, I've found importing text files via the tm package particularly annoying. However, once data is in memory the rest of the code is fairly straightforward (see below).

Processing Code

import processing.pdf.*;
import wordcram.*;

PFont georgia = createFont("Georgia", 1);
PFont Courier = createFont("Courier New", 1);
PFont georgiaItalic = createFont("Georgia Italic", 1);
PFont minyaNouvelle = createFont("../MINYN___.TTF", 1);

size(800, 450);
\\size(800, 450, PDF, "int.pdf");


/* You can control what font(s) your words are drawn in,
 * what color(s) they're drawn in, and what angle(s)
 * they're drawn at.
 * - Use any Processing colors
 * - Angles are in radians
 * - Use any Processing PFonts, or use a font name
 * WordCram has some convenient methods for some common
 * uses.

new WordCram(this)
  // Coloring words
  .withColors(#FF0000, #00CC00, #0000FF)
  //.withColor(color(0, 20, 150, 150)) // alpha works, too
  // #785c75, #7D956C, #8A9467, #816F6C, #9D855E, #939464,#9B9478
  // But this won't work the way you expect it to:
  //.withColors(255, 0, 0) // Not red - invisible!
  // See the FAQ for the details, if you're curious:
  // (WordCram FAQ at GoogleCode)
  // Words at Angles
  //.angledAt(radians(30), radians(-60))
  // Two-thirds of the words will be at 30 degrees, the rest at -60.
  //.angledAt(radians(30), radians(30), radians(-60))
  .angledBetween(-PI/8, PI/8)
  //.angledBetween(0, TWO_PI)
  // Much bigger: not all words will fit, and it'll take 
  // longer to place them. Be patient!
  .sizedByWeight(20, 50)
  // Fonts
  //.withFonts("Georgia", "Minya Nouvelle")
  //.withFonts(georgia, minyaNouvelle)
  // Padding

R Code

## Word clouds

f2f  <-Corpus(DirSource("textf2f"), readerControl = list(language="lat")) #specifies the exact folder containing text file(s)for analysis with tm.

#remove whitespace
f2f <- tm_map(f2f, stripWhitespace)

# remove punctuation
f2f<- tm_map(f2f, removePunctuation)

# remove numbers
f2f <- tm_map(f2f, removeNumbers)

# remove stopwords
f2f<- tm_map(f2f, removeWords, stopwords('english'))

wordcloud(f2f, scale=c(8,.3),min.freq=2,max.words=100, random.order=T, 

                       rot.per=.15, colors="red", vfont=c("sans serif","plain"))

*See here for how to install packages in R

No comments:

Post a Comment