Text Mining

Posted by Lisa on November 5, 2015
Text Mining

Data

The unstructured text data is called “radiology”. Let’s read it first

library(tm)
radiology=read.csv("radiology_results.csv",header = TRUE)
radiology=read.csv("radiology_results.csv",header = TRUE, stringsAsFactors=FALSE) #set strings As Strings

Term frequency

Let’s applying a simple text analysis to find the frequency of each important word used. We are going to do the following steps:

  1. Set up a source for your text.
  2. Create a corpus from that source (a corpus is just another name for a collection of texts).
  3. Create a document-term matrix, which tells you how frequently each term appears in each document in your corpus.
#only take a sample of 1000 obs
radiology=radiology[1:1000,]
#simply paste every review together, separating with a space.
radiology_result <- paste(radiology$RESULT , collapse=" ")

radiology_source <- VectorSource(radiology_result)
corpus <- Corpus(radiology_source)
#transforme every word to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
#remove punctuation
corpus <- tm_map(corpus, removePunctuation)
#stripped out any extra whitespace
corpus <- tm_map(corpus, stripWhitespace)
#remove stopwords
stopwords("english")
##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"
corpus <- tm_map(corpus, removeWords, stopwords("english"))

#create the document-term matrix
dtm <- DocumentTermMatrix(corpus)
#convert our term-document-matrix into a normal matrix
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
#sort this vector to see the most frequently used words
frequency <- sort(frequency, decreasing=TRUE)
head(frequency)
##    right     left    chest   normal findings     seen 
##      733      664      653      631      566      465

Hmm. It seems a kind of usless, because we are not interested in words like “right” and “left”. I’ll make a more useful text analysis on “text mining2”. To be continue,…

Plot these frequencies as a word cloud

Just for fun!

#Plotting a word cloud
#install.packages('wordcloud')
library(wordcloud)
## Loading required package: RColorBrewer
words <- names(frequency)
wordcloud(words[1:100], frequency[1:100])