- Type the following strings:
text <- c("This is me.", "That is her.", "this's it!", "Oh my gosh.", "What is it?")
- You might need to look up the cheatsheet for regular expression. here
Thursday, November 05, 2015
text <- c("This is me.", "That is her.", "this's it!", "Oh my gosh.", "What is it?")
Search for sentences that starts with the letter 't', including its uppercase/lowercase.
grep('^t', text, ignore.case=T, value=T)
## [1] "This is me." "That is her." "this's it!"
What's the difference if the 'value' is set to 'F'?
grep('^t', text, ignore.case=T, value=T)
## [1] "This is me." "That is her." "this's it!"
grep('^t', text, ignore.case=T, value=F)
## [1] 1 2 3
What do the following codes do?
grep('t.i', text, ignore.case=T, value=T)
## [1] "This is me." "That is her." "this's it!" "What is it?"
# search matches of 't_i'
What do the following codes do?
grep('!$', text, value=T)
## [1] "this's it!"
# find the string ends with '!'
How to search the sentences end with '!' or '?'
grep('(!|\\?)$', text, value=T)
## [1] "this's it!" "What is it?"
What to do if there are a lot of mis-typed spaces in a sentence?
s <- 'I am a student.'
gsub(' +', ' ', s)
## [1] "I am a student."
We can automatically specify the word boundary tag on the data text.
gsub('\\b', '<WB>', text)
## [1] "<WB>T<WB>h<WB>i<WB>s<WB> <WB>i<WB>s<WB> <WB>m<WB>e<WB>.<WB>" ## [2] "<WB>T<WB>h<WB>a<WB>t<WB> <WB>i<WB>s<WB> <WB>h<WB>e<WB>r<WB>.<WB>" ## [3] "<WB>t<WB>h<WB>i<WB>s<WB>'<WB>s<WB> <WB>i<WB>t<WB>!<WB>" ## [4] "<WB>O<WB>h<WB> <WB>m<WB>y<WB> <WB>g<WB>o<WB>s<WB>h<WB>.<WB>" ## [5] "<WB>W<WB>h<WB>a<WB>t<WB> <WB>i<WB>s<WB> <WB>i<WB>t<WB>?<WB>"
How to tag each sentence in text with <s> at the initial of every sentence, and </s> at the end of every sentence?
t1 <- gsub('^', '<s>', text) gsub('$', '</s>', t1)
## [1] "<s>This is me.</s>" "<s>That is her.</s>" "<s>this's it!</s>" ## [4] "<s>Oh my gosh.</s>" "<s>What is it?</s>"
f <- readLines('http://www.gutenberg.org/cache/epub/11/pg11.txt')
alice <- length(grep('Alice', f, value=T)) rabbit <- length(grep('[Rr]abbit', f, value=T)) cat <- length(grep('[Cc]at\\b', f, value=T)) caterpillar <- length(grep('[Cc]aterpillar', f, value=T)) barplot(c(alice=alice, rabbit=rabbit, cat=cat, caterpillar=caterpillar), ylim=c(0, 400))
split <- unlist(strsplit(f, split=' ')) capital <- grep('[A-Z]{2,}', split, value=T) cap <- gsub(",|\\.|\\:|\\?|\\'|\\!|\\[|\\(|\\)|\\;", "", capital) cap <- gsub('\\"', '', cap) table(cap) # characters can be extracted via # the pattern: said the NAME or said NAME.