Home
Exercises
- Chapter 1
- Chapter 2
- Chapter 3
- Chapter 4
- Chapter 5
- Chapter 6
- Chapter 7
- Chapter 8
- Chapter 9
- Chapter 10
- Chapter 11
- Chapter 12
- Chapter 13
- Chapter 14
- Chapter 15
- Chapter 16
- Chapter 17
- Chapter 18
- Chapter 19
- Chapter 20
- Chapter 21
- Chapter 22
- Chapter 23
- Chapter 24
- Reference 1
- Reference 2
Classic Models
Cases
- Chapter 1
- Chapter 2
- Chapter 3
- Chapter 4
- Chapter 5
- Chapter 6
- Chapter 7
- Chapter 8
- Chapter 9
- Chapter 10
- Chapter 11
- Chapter 12
- Chapter 13
- Chapter 14
- Chapter 15
- Chapter 16
- Chapter 17
- Chapter 18
- Chapter 19
- Chapter 20
- Chapter 21
- Chapter 22
- Chapter 23
- Chapter 24
Test questions
- Chapter 1
- Chapter 2
- Chapter 3
- Chapter 4
- Chapter 5
- Chapter 6
- Chapter 7
- Chapter 8
- Chapter 9
- Chapter 10
- Chapter 11
- Chapter 12
- Chapter 13
- Chapter 14
- Chapter 15
- Chapter 16
- Chapter 17
- Chapter 18
- Chapter 19
- Chapter 20
- Chapter 21
- Chapter 22
- Chapter 23
- Chapter 24
- Reference 1
- Reference 2
Lab exercises
- Data integration
- Exams

17 - Text Mining

Answers to exercises

1.

Take the recent annual reports for UPS and convert them to text using an online service, such as http://www.fileformat.info/convert/doc/pdf2txt.htm. Complete the following tasks:

1a.

Count the words in the most recent annual report.

f <- readChar("http://dl.dropboxusercontent.com/u/53671029/UGAMIT/UPS_Annual_Reports/2012.txt", nchars=1e6
y <-  str_split(f, " ")
# report length of the vector
length(y[[1]])

1b.

Compute the readability of the most recent annual report.

require(koRpus)
#tokenize
check.text <- tokenize(f, format='obj',lang='en')
#score
readability(check.text, 'Flesch.Kincaid',hyphen=NULL,force.lang='en')

1c.

Create a corpus.

  #set up data frame to hold 5 UPS Annual Reports 
df <-  data.frame(num=5)
begin <-  2008
i <-  begin
#read the Annual Reports
while (i < 2013) {   
  y <- as.character(i)
  #create the file name    
  f <- str_c('http://dl.dropboxusercontent.com/u/53671029/UGAMIT/UPS_Annual_Reports/',y,'.txt',sep='')
  #read the annual report as on large string
  d <-  readChar(f,nchars=1e6)
  #add annual report to the data frame
  df[i-begin+1,] <-  d   
  i <-  i + 1 
} 

#create the corpus
reports <-  Corpus(DataframeSource(as.data.frame(df), encoding = "UTF-8"))

1d.

Preprocess the corpus.

require(Snowball)
require(SnowballC)
require(RWeka)
require(rJava) 
require(RWekajars)

#convert all letters to lower case
clean.reports <-  tm_map(reports,tolower)
#remove punctuation
clean.reports <-  tm_map(clean.reports,removePunctuation)
#remove all numbers
clean.reports <-  tm_map(clean.reports,removeNumbers)
#strip white space
clean.reports <-  tm_map(clean.reports,stripWhitespace)
#stop word filter
clean.reports <-  tm_map(clean.reports,removeWords,stopwords("SMART"))
#remove common words
dictionary <-  c("UPS", "united", "parcel", "million", "billion", "dollar")
clean.reports <-  tm_map(clean.reports,removeWords,dictionary)
#stem words to their roots
stem.reports <-  tm_map(clean.reports,stemDocument, language = "english")

1e.

Create a term-document matrix and compute the frequency of words in the corpus.

tdm <-  TermDocumentMatrix(clean.reports,control = list(minWordLength=3))
tdm.stem <- stemCompletion(rownames(tdm), dictionary=clean.reports, type=c("prevalent"))
rownames(tdm) <- as.vector(tdm.stem)
findFreqTerms(tdm, lowfreq = 500, highfreq = Inf)

1f.

Construct a word cloud for the 25 most common words.

#convert term document matrix to a regular matrix to get frequencies of words
m <-  as.matrix(tdm)
#sort on frequency of terms to get frequencies of words
v <- sort(rowSums(m), decreasing=TRUE)
#get the names corresponding to the words
names <- names(v)
# create a data frame for plotting
d <- data.frame(word=names, freq=v)
require(wordcloud)
#select the color palette
pal = brewer.pal(5,"BuGn")
#generate the cloud based on the 25 most frequent words
wordcloud(d$word, d$freq, min.freq=d$freq[25],colors=pal)

1g.

Undertake a cluster analysis, identify which reports are similar in nature, and see if you can explain why some reports are in different clusters.

require(ggplot2)
require(ggdendro)
#name the columns for the report's year
colnames(tdm) <-  2008:2012
#remove sparse terms
tdm1 <- removeSparseTerms(tdm, 0.5) 
#transpose the matrix
tdmtranspose <-  t(tdm1) 
cluster = hclust(dist(tdmtranspose),method='centroid')
#get the clustering data
dend <-  as.dendrogram(cluster)
#plot the tree
ggdendrogram(dend,rotate=T)

1h.

Build a topic model for the annual reports.

2.

Download KNIME and run the New York Times RSS feed analyzer.38 Note that you will need to add the community contributed plugin, Palladian, to run the analyzer. See Preferences > Install/Update > Available Software Sites and the community contribution page .

3.

Merge the annual reports for Berkshire Hathaway (i.e., Buffet's letters) and UPS into a single corpus.

3a.

Undertake a cluster analysis and identify which reports are similar in nature.

3b.

Build a topic model for the combined annual reports.

3c.

Do the cluster analysis and topic model suggest considerable differences in the two sets of reports?

This page is part of the promotional and support material for Data Management (open edition) by Richard T. Watson
For questions and comments please contact the author

17 - Text Mining

Answers to exercises

Date revised: 10-Dec-2021