Banner

 

17 - Text Mining

Answers to exercises

1.
Take the recent annual reports for UPS and convert them to text using an online service, such as http://www.fileformat.info/convert/doc/pdf2txt.htm. Complete the following tasks:
1a.
Count the words in the most recent annual report.
f <- readChar("http://dl.dropboxusercontent.com/u/53671029/UGAMIT/UPS_Annual_Reports/2012.txt", nchars=1e6
y <-  str_split(f, " ")
# report length of the vector
length(y[[1]])
1b.
Compute the readability of the most recent annual report.
require(koRpus)
#tokenize
check.text <- tokenize(f, format='obj',lang='en')
#score
readability(check.text, 'Flesch.Kincaid',hyphen=NULL,force.lang='en')
1c.
Create a corpus.
  #set up data frame to hold 5 UPS Annual Reports 
df <-  data.frame(num=5)
begin <-  2008
i <-  begin
#read the Annual Reports
while (i < 2013) {   
  y <- as.character(i)
  #create the file name    
  f <- str_c('http://dl.dropboxusercontent.com/u/53671029/UGAMIT/UPS_Annual_Reports/',y,'.txt',sep='')
  #read the annual report as on large string
  d <-  readChar(f,nchars=1e6)
  #add annual report to the data frame
  df[i-begin+1,] <-  d   
  i <-  i + 1 
} 

#create the corpus
reports <-  Corpus(DataframeSource(as.data.frame(df), encoding = "UTF-8")) 
1d.
Preprocess the corpus.
require(Snowball)
require(SnowballC)
require(RWeka)
require(rJava) 
require(RWekajars)

#convert all letters to lower case
clean.reports <-  tm_map(reports,tolower)
#remove punctuation
clean.reports <-  tm_map(clean.reports,removePunctuation)
#remove all numbers
clean.reports <-  tm_map(clean.reports,removeNumbers)
#strip white space
clean.reports <-  tm_map(clean.reports,stripWhitespace)
#stop word filter
clean.reports <-  tm_map(clean.reports,removeWords,stopwords("SMART"))
#remove common words
dictionary <-  c("UPS", "united", "parcel", "million", "billion", "dollar")
clean.reports <-  tm_map(clean.reports,removeWords,dictionary)
#stem words to their roots
stem.reports <-  tm_map(clean.reports,stemDocument, language = "english")
1e.
Create a term-document matrix and compute the frequency of words in the corpus.
tdm <-  TermDocumentMatrix(clean.reports,control = list(minWordLength=3))
tdm.stem <- stemCompletion(rownames(tdm), dictionary=clean.reports, type=c("prevalent"))
rownames(tdm) <- as.vector(tdm.stem)
findFreqTerms(tdm, lowfreq = 500, highfreq = Inf)
1f.
Construct a word cloud for the 25 most common words.
#convert term document matrix to a regular matrix to get frequencies of words
m <-  as.matrix(tdm)
#sort on frequency of terms to get frequencies of words
v <- sort(rowSums(m), decreasing=TRUE)
#get the names corresponding to the words
names <- names(v)
# create a data frame for plotting
d <- data.frame(word=names, freq=v)
require(wordcloud)
#select the color palette
pal = brewer.pal(5,"BuGn")
#generate the cloud based on the 25 most frequent words
wordcloud(d$word, d$freq, min.freq=d$freq[25],colors=pal)
1g.
Undertake a cluster analysis, identify which reports are similar in nature, and see if you can explain why some reports are in different clusters.
require(ggplot2)
require(ggdendro)
#name the columns for the report's year
colnames(tdm) <-  2008:2012
#remove sparse terms
tdm1 <- removeSparseTerms(tdm, 0.5) 
#transpose the matrix
tdmtranspose <-  t(tdm1) 
cluster = hclust(dist(tdmtranspose),method='centroid')
#get the clustering data
dend <-  as.dendrogram(cluster)
#plot the tree
ggdendrogram(dend,rotate=T)
1h.
Build a topic model for the annual reports.

2.
Download KNIME and run the New York Times RSS feed analyzer.38 Note that you will need to add the community contributed plugin, Palladian, to run the analyzer. See Preferences > Install/Update > Available Software Sites and the community contribution page .
3.
Merge the annual reports for Berkshire Hathaway (i.e., Buffet's letters) and UPS into a single corpus.
3a.
Undertake a cluster analysis and identify which reports are similar in nature.

3b.
Build a topic model for the combined annual reports.

3c.
Do the cluster analysis and topic model suggest considerable differences in the two sets of reports?

This page is part of the promotional and support material for Data Management (open edition) by Richard T. Watson
For questions and comments please contact the author
Date revised: 10-Dec-2021