Text Mining
Lisa
November 17, 2015
Data
The unstructured text data is called “radiology”. Let’s read it first
library(tm)
radiology=read.csv("radiology_results.csv",header = TRUE)
radiology=read.csv("radiology_results.csv",header = TRUE, stringsAsFactors=FALSE) #set strings As Strings
Term frequency
Let’s applying a simple text analysis to find the frequency of each important word used. We are going to do the following steps:
- Set up a source for your text.
- Create a corpus from that source (a corpus is just another name for a collection of texts).
- Create a document-term matrix, which tells you how frequently each term appears in each document in your corpus.
#only take a sample of 1000 obs
radiology=radiology[1:1000,]
#simply paste every review together, separating with a space.
radiology_result <- paste(radiology$RESULT , collapse=" ")
radiology_source <- VectorSource(radiology_result)
corpus <- Corpus(radiology_source)
#transforme every word to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
#remove punctuation
corpus <- tm_map(corpus, removePunctuation)
#stripped out any extra whitespace
corpus <- tm_map(corpus, stripWhitespace)
#remove stopwords
stopwords("english")
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very"
corpus <- tm_map(corpus, removeWords, stopwords("english"))
#create the document-term matrix
dtm <- DocumentTermMatrix(corpus)
#convert our term-document-matrix into a normal matrix
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
#sort this vector to see the most frequently used words
frequency <- sort(frequency, decreasing=TRUE)
head(frequency)
## right left chest normal findings seen
## 733 664 653 631 566 465
Hmm. It seems a kind of usless, because we are not interested in words like “right” and “left”. I’ll make a more useful text analysis on “text mining2”. To be continue,…
Plot these frequencies as a word cloud
Just for fun!
#Plotting a word cloud
#install.packages('wordcloud')
library(wordcloud)
## Loading required package: RColorBrewer
words <- names(frequency)
wordcloud(words[1:100], frequency[1:100])