Week8 Lab - Regular Expressions

Thursday, November 05, 2015

Before we get started …

Type the following strings:

text <- c("This is me.", "That is her.", "this's it!", 
          "Oh my gosh.", "What is it?")

You might need to look up the cheatsheet for regular expression. here

Recap with quick questions

Question 1

Search for sentences that starts with the letter 't', including its uppercase/lowercase.

Answer 1

grep('^t', text, ignore.case=T, value=T)

## [1] "This is me."  "That is her." "this's it!"

Question 2

What's the difference if the 'value' is set to 'F'?

Answer 2

grep('^t', text, ignore.case=T, value=T)

## [1] "This is me."  "That is her." "this's it!"

grep('^t', text, ignore.case=T, value=F)

## [1] 1 2 3

Question 3

What do the following codes do?

grep('t.i', text, ignore.case=T, value=T)

## [1] "This is me."  "That is her." "this's it!"   "What is it?"

Answer 3

# search matches of 't_i'

Question 4

What do the following codes do?

grep('!$', text, value=T)

## [1] "this's it!"

Answer 4

# find the string ends with '!'

Question 5

How to search the sentences end with '!' or '?'

Answer 5

grep('(!|\\?)$', text, value=T)

## [1] "this's it!"  "What is it?"

Question 6

What to do if there are a lot of mis-typed spaces in a sentence?

s <- 'I am a     student.'

Answer 6

gsub(' +', ' ', s)

## [1] "I am a student."

Question 7

We can automatically specify the word boundary tag on the data text.

gsub('\\b', '<WB>', text)

## [1] "<WB>T<WB>h<WB>i<WB>s<WB> <WB>i<WB>s<WB> <WB>m<WB>e<WB>.<WB>"     
## [2] "<WB>T<WB>h<WB>a<WB>t<WB> <WB>i<WB>s<WB> <WB>h<WB>e<WB>r<WB>.<WB>"
## [3] "<WB>t<WB>h<WB>i<WB>s<WB>'<WB>s<WB> <WB>i<WB>t<WB>!<WB>"          
## [4] "<WB>O<WB>h<WB> <WB>m<WB>y<WB> <WB>g<WB>o<WB>s<WB>h<WB>.<WB>"     
## [5] "<WB>W<WB>h<WB>a<WB>t<WB> <WB>i<WB>s<WB> <WB>i<WB>t<WB>?<WB>"

How to tag each sentence in text with <s> at the initial of every sentence, and </s> at the end of every sentence?

Answer 7

t1 <- gsub('^', '<s>', text)
gsub('$', '</s>', t1)

## [1] "<s>This is me.</s>"  "<s>That is her.</s>" "<s>this's it!</s>"  
## [4] "<s>Oh my gosh.</s>"  "<s>What is it?</s>"

Exercises

Before we get started …

load in Alice's Adventures in Wonderland from the below link: [http://www.gutenberg.org/cache/epub/11/pg11.txt]

f <- readLines('http://www.gutenberg.org/cache/epub/11/pg11.txt')

Exercises

[a] Count the frequencies of Alice, rabbit, cat and caterpillar within the text. (Note: be aware of the uppercase and lowercase.)
[b] Use a bar plot to present the above frequencies. (Note: remember to label the x-axis, and change the range of y-axis to 400.)
[c] There are a lot of words that are presented using capital letters (e.g. IS, SHE, THINK …). Find out these words and count the occurrences respectively. (Hint: you might need to run this first: unlist(strsplit(file, split=' ')). And remember to clean out irrelevant symbols before counting the frequencies.)
[d] Describe ways to find out most of the story characters within the text.

Answers

alice <- length(grep('Alice', f, value=T))
rabbit <- length(grep('[Rr]abbit', f, value=T))
cat <- length(grep('[Cc]at\\b', f, value=T))
caterpillar <- length(grep('[Cc]aterpillar', f, value=T))

barplot(c(alice=alice, rabbit=rabbit, cat=cat, 
          caterpillar=caterpillar), ylim=c(0, 400))

split <- unlist(strsplit(f, split=' '))
capital <- grep('[A-Z]{2,}', split, value=T)
cap <- gsub(",|\\.|\\:|\\?|\\'|\\!|\\[|\\(|\\)|\\;", "", capital)
cap <- gsub('\\"', '', cap)
table(cap)

# characters can be extracted via 
# the pattern: said the NAME or said NAME.