Banner

 

16 - Text Mining

Answers to exercises

1.
Take the recent annual reports for UPS and convert them to text using an online service, such as http://www.fileformat.info/convert/doc/pdf2txt.htm. Complete the following tasks:
1a.
Count the words in the most recent annual report.
f <- readChar("http://dl.dropboxusercontent.com/u/53671029/UGAMIT/UPS_Annual_Reports/2012.txt", nchars=1e6
y <-  str_split(f, " ")
# report length of the vector
length(y[[1]])
1c.
Create a corpus.
  #set up data frame to hold 5 UPS Annual Reports 
df <-  data.frame(num=5)
begin <-  2008
i <-  begin
#read the Annual Reports
while (i < 2013) {   
  y <- as.character(i)
  #create the file name    
  f <- str_c('http://dl.dropboxusercontent.com/u/53671029/UGAMIT/UPS_Annual_Reports/',y,'.txt',sep='')
  #read the annual report as on large string
  d <-  readChar(f,nchars=1e6)
  #add annual report to the data frame
  df[i-begin+1,] <-  d   
  i <-  i + 1 
} 

#create the corpus
reports <-  Corpus(DataframeSource(as.data.frame(df), encoding = "UTF-8")) 
1e.
Create a term-document matrix and compute the frequency of words in the corpus.
tdm <-  TermDocumentMatrix(clean.reports,control = list(minWordLength=3))
tdm.stem <- stemCompletion(rownames(tdm), dictionary=clean.reports, type=c("prevalent"))
rownames(tdm) <- as.vector(tdm.stem)
findFreqTerms(tdm, lowfreq = 500, highfreq = Inf)
1g.
Undertake a cluster analysis, identify which reports are similar in nature, and see if you can explain why some reports are in different clusters.
require(ggplot2)
require(ggdendro)
#name the columns for the report's year
colnames(tdm) <-  2008:2012
#remove sparse terms
tdm1 <- removeSparseTerms(tdm, 0.5) 
#transpose the matrix
tdmtranspose <-  t(tdm1) 
cluster = hclust(dist(tdmtranspose),method='centroid')
#get the clustering data
dend <-  as.dendrogram(cluster)
#plot the tree
ggdendrogram(dend,rotate=T)
3.
Merge the annual reports for Berkshire Hathaway (i.e., Buffet's letters) and UPS into a single corpus.
3a.
Undertake a cluster analysis and identify which reports are similar in nature.

3c.
Do the cluster analysis and topic model suggest considerable differences in the two sets of reports?

This page is part of the promotional and support material for Data Management (sixth edition) by Richard T. Watson
For questions and comments please contact the author

Date revised: 19-Oct-2016