Thursday, November 05, 2015

Before we get started …

  • Type the following strings:
text <- c("This is me.", "That is her.", "this's it!", 
          "Oh my gosh.", "What is it?")
  • You might need to look up the cheatsheet for regular expression. here

Recap with quick questions

Question 1

Search for sentences that starts with the letter 't', including its uppercase/lowercase.

Answer 1

grep('^t', text, ignore.case=T, value=T)
## [1] "This is me."  "That is her." "this's it!"

Question 2

What's the difference if the 'value' is set to 'F'?

Answer 2

grep('^t', text, ignore.case=T, value=T)
## [1] "This is me."  "That is her." "this's it!"
grep('^t', text, ignore.case=T, value=F)
## [1] 1 2 3

Question 3

What do the following codes do?

grep('t.i', text, ignore.case=T, value=T)
## [1] "This is me."  "That is her." "this's it!"   "What is it?"

Answer 3

# search matches of 't_i'

Question 4

What do the following codes do?

grep('!$', text, value=T)
## [1] "this's it!"

Answer 4

# find the string ends with '!'

Question 5

How to search the sentences end with '!' or '?'

Answer 5

grep('(!|\\?)$', text, value=T)
## [1] "this's it!"  "What is it?"

Question 6

What to do if there are a lot of mis-typed spaces in a sentence?

s <- 'I am a     student.'

Answer 6

gsub(' +', ' ', s)
## [1] "I am a student."

Question 7

We can automatically specify the word boundary tag on the data text.

gsub('\\b', '<WB>', text)
## [1] "<WB>T<WB>h<WB>i<WB>s<WB> <WB>i<WB>s<WB> <WB>m<WB>e<WB>.<WB>"     
## [2] "<WB>T<WB>h<WB>a<WB>t<WB> <WB>i<WB>s<WB> <WB>h<WB>e<WB>r<WB>.<WB>"
## [3] "<WB>t<WB>h<WB>i<WB>s<WB>'<WB>s<WB> <WB>i<WB>t<WB>!<WB>"          
## [4] "<WB>O<WB>h<WB> <WB>m<WB>y<WB> <WB>g<WB>o<WB>s<WB>h<WB>.<WB>"     
## [5] "<WB>W<WB>h<WB>a<WB>t<WB> <WB>i<WB>s<WB> <WB>i<WB>t<WB>?<WB>"

How to tag each sentence in text with <s> at the initial of every sentence, and </s> at the end of every sentence?

Answer 7

t1 <- gsub('^', '<s>', text)
gsub('$', '</s>', t1)
## [1] "<s>This is me.</s>"  "<s>That is her.</s>" "<s>this's it!</s>"  
## [4] "<s>Oh my gosh.</s>"  "<s>What is it?</s>"

Exercises

Before we get started …

Exercises

  • [a] Count the frequencies of Alice, rabbit, cat and caterpillar within the text. (Note: be aware of the uppercase and lowercase.)
  • [b] Use a bar plot to present the above frequencies. (Note: remember to label the x-axis, and change the range of y-axis to 400.)
  • [c] There are a lot of words that are presented using capital letters (e.g. IS, SHE, THINK …). Find out these words and count the occurrences respectively. (Hint: you might need to run this first: unlist(strsplit(file, split=' ')). And remember to clean out irrelevant symbols before counting the frequencies.)
  • [d] Describe ways to find out most of the story characters within the text.

Answers

alice <- length(grep('Alice', f, value=T))
rabbit <- length(grep('[Rr]abbit', f, value=T))
cat <- length(grep('[Cc]at\\b', f, value=T))
caterpillar <- length(grep('[Cc]aterpillar', f, value=T))

barplot(c(alice=alice, rabbit=rabbit, cat=cat, 
          caterpillar=caterpillar), ylim=c(0, 400))
split <- unlist(strsplit(f, split=' '))
capital <- grep('[A-Z]{2,}', split, value=T)
cap <- gsub(",|\\.|\\:|\\?|\\'|\\!|\\[|\\(|\\)|\\;", "", capital)
cap <- gsub('\\"', '', cap)
table(cap)

# characters can be extracted via 
# the pattern: said the NAME or said NAME.