I am now removing stopwords from all parts of the analysis.
For each string:
This data will be eventually used for predicting the next word in a random statement.
See the Appendix for code.
These are the stop words removed.
[1] "the" "of" "and" "a" "to" "in" "is" "you" "that" "it"
[11] "he" "was" "for" "on" "are" "as" "with" "his" "they" "I"
[21] "at" "be" "this" "have" "from"
[1] "SETUP COMPLETE"
[1] "REPLACEMENT COMPLETE"
when breathe want air will there would live would ...
breathe want air will there would live would ...
want air will there would live would ...
air will there would live would ...
will there would live would ...
there would live would ...
would live would ...
Possible Next Word Frequency
1 progress 1
live would ...
Possible Next Word Frequency
1 not 2
2 able 1
3 been 1
4 buy 1
5 change 1
6 congratulate 1
7 drooling 1
8 exclusive 1
9 exhibited 1
10 give 1
would ...
Possible Next Word Frequency
1 like 4822
2 not 4688
3 been 2230
4 love 1693
5 never 1276
6 say 1273
7 do 1178
8 make 1168
9 take 858
10 go 797
guy my tables wife got up go bathroom asked about dessert started telling me about ...
my tables wife got up go bathroom asked about dessert started telling me about ...
tables wife got up go bathroom asked about dessert started telling me about ...
wife got up go bathroom asked about dessert started telling me about ...
got up go bathroom asked about dessert started telling me about ...
up go bathroom asked about dessert started telling me about ...
go bathroom asked about dessert started telling me about ...
bathroom asked about dessert started telling me about ...
asked about dessert started telling me about ...
about dessert started telling me about ...
dessert started telling me about ...
started telling me about ...
Possible Next Word Frequency
1 her 2
2 first 1
3 utility 1
4 visit 1
telling me about ...
Possible Next Word Frequency
1 her 6
2 how 4
3 one 3
4 my 2
5 she 2
6 some 2
7 them 2
8 your 2
9 all 1
10 being 1
me about ...
Possible Next Word Frequency
1 how 56
2 my 54
3 her 38
4 your 36
5 what 34
6 when 28
7 because 22
8 an 19
9 their 19
10 but 18
about ...
Possible Next Word Frequency
1 how 2709
2 my 2284
3 what 2196
4 her 1131
5 their 980
6 them 882
7 your 876
8 our 820
9 all 807
10 me 725
would give anything see arctic monkeys ...
give anything see arctic monkeys ...
anything see arctic monkeys ...
see arctic monkeys ...
arctic monkeys ...
Possible Next Word Frequency
1 bass 1
2 blink 1
3 classic 1
4 hopefully 1
5 humbug 1
6 immediately 1
7 really 1
monkeys ...
Possible Next Word Frequency
1 or 5
2 were 4
3 who 4
4 an 3
5 but 3
6 one 3
7 really 3
8 their 3
9 when 3
10 am 2
talking your mom has same effect hug helps reduce your ...
your mom has same effect hug helps reduce your ...
mom has same effect hug helps reduce your ...
has same effect hug helps reduce your ...
same effect hug helps reduce your ...
effect hug helps reduce your ...
hug helps reduce your ...
helps reduce your ...
reduce your ...
Possible Next Word Frequency
1 risk 11
2 debt 3
3 electricity 2
4 needs 2
5 time 2
6 anxiety 1
7 budget 1
8 chance 1
9 chances 1
10 costs 1
your ...
Possible Next Word Frequency
1 own 2504
2 life 1485
3 favorite 856
4 heart 662
5 mind 538
6 blog 536
7 eyes 519
8 head 491
9 child 490
10 body 467
when were holland were like 1 inch away me but hadnt time take ...
were holland were like 1 inch away me but hadnt time take ...
holland were like 1 inch away me but hadnt time take ...
were like 1 inch away me but hadnt time take ...
like 1 inch away me but hadnt time take ...
1 inch away me but hadnt time take ...
inch away me but hadnt time take ...
away me but hadnt time take ...
me but hadnt time take ...
but hadnt time take ...
hadnt time take ...
time take ...
Possible Next Word Frequency
1 care 18
2 look 12
3 break 7
4 me 7
5 action 6
6 some 6
7 up 6
8 your 6
9 get 5
10 back 4
take ...
Possible Next Word Frequency
1 care 1287
2 place 1281
3 me 1028
4 my 931
5 time 842
6 look 734
7 over 716
8 up 711
9 out 698
10 some 672
would just like all these questions answered presentation evidence jury settle ...
just like all these questions answered presentation evidence jury settle ...
like all these questions answered presentation evidence jury settle ...
all these questions answered presentation evidence jury settle ...
these questions answered presentation evidence jury settle ...
questions answered presentation evidence jury settle ...
answered presentation evidence jury settle ...
presentation evidence jury settle ...
evidence jury settle ...
jury settle ...
settle ...
Possible Next Word Frequency
1 down 282
2 into 192
3 back 42
4 my 35
5 by 33
6 her 23
7 some 19
8 our 18
9 but 17
10 one 16
can not deal unsymetrical things can not even hold an uneven number bags groceries each ...
not deal unsymetrical things can not even hold an uneven number bags groceries each ...
deal unsymetrical things can not even hold an uneven number bags groceries each ...
unsymetrical things can not even hold an uneven number bags groceries each ...
things can not even hold an uneven number bags groceries each ...
can not even hold an uneven number bags groceries each ...
not even hold an uneven number bags groceries each ...
even hold an uneven number bags groceries each ...
hold an uneven number bags groceries each ...
an uneven number bags groceries each ...
uneven number bags groceries each ...
number bags groceries each ...
bags groceries each ...
groceries each ...
each ...
Possible Next Word Frequency
1 other 4242
2 day 1037
3 one 706
4 time 632
5 out 622
6 every 575
7 year 553
8 us 541
9 them 466
10 my 443
every inch perfect bottom ...
inch perfect bottom ...
perfect bottom ...
bottom ...
Possible Next Word Frequency
1 line 331
2 my 125
3 post 98
4 pan 73
5 page 56
6 right 56
7 left 53
8 your 46
9 up 44
10 each 40
am thankful my childhood filled imagination bruises playing ...
thankful my childhood filled imagination bruises playing ...
my childhood filled imagination bruises playing ...
childhood filled imagination bruises playing ...
filled imagination bruises playing ...
imagination bruises playing ...
bruises playing ...
playing ...
Possible Next Word Frequency
1 around 137
2 my 125
3 game 103
4 along 102
5 games 94
6 field 88
7 her 82
8 their 66
9 music 61
10 some 60
like how same people amst all adam sandlers ...
how same people amst all adam sandlers ...
same people amst all adam sandlers ...
people amst all adam sandlers ...
amst all adam sandlers ...
all adam sandlers ...
adam sandlers ...
sandlers ...
[1] "NO MATCHES"
Although it runs faster, the discrepancy between these answers and those of Sentence Fragments and NLP II suggests that removing stopwords is a terrible idea.
A good example is “take care” and “take a look”.
Stopwords are extremely useful in common phrases.
#Files are read. Non graphical characters are removed
blogs <- read_lines(file = "en_US.blogs.txt")
blogs = iconv(blogs, "UTF-8", "ASCII", sub = "byte")
news <- read_lines(file = "en_US.news.txt")
news = iconv(news, "UTF-8", "ASCII", sub = "byte")
tweets <- read_lines(file = "en_US.twitter.txt")
tweets = iconv(tweets, "UTF-8", "ASCII", sub = "byte")
#Text Cleanup Function:
clean_up <- function(data) {
#"’" is replaced with "'", contractions are expanded.
temp <- replace_contraction(gsub("’", "'", data),
contraction.key = lexicon::key_contractions)
#"U.S." is replaced with "US".
temp <- gsub("U.S.", "US", temp)
#"a.m." is replaced with "am".
temp <- gsub("a.m.", "am", temp)
#"p.m." is replaced with "pm".
temp <- gsub("p.m.", "pm", temp)
#Remove punctuations.
temp <- removePunctuation(temp)
#Remove stopwords.
temp <- rm_stopwords(temp)
return(temp)
}
#Function returns the word immediately after a string for all lines that contain the string."
retrieve_next_words <- function(input_string_split) {
#Word length of input string.
n <- length(input_string_split)
#Recombine into string
input_string <- paste(input_string_split, collapse = " ")
#Print string and whitespace.
cat(paste('\n\n', input_string, '...\n'))
#Final list.
next_words <- NULL
for (bnt in bnts) {
#Check if the string is in the line.
if(grepl(input_string, bnt)) {
#Find where in the line it is located.
beginning <- str_locate(bnt, input_string)[1]
#Start from where the string is, split by space, grab the next word.
next_word <- strsplit(substring(bnt, beginning), " ")[[1]][n+1]
#Get rid of non-alphanumeric.
next_word <- str_replace_all(next_word, regex("\\W+"), "")
#Update final list.
next_words <- c(next_words, next_word)
}
}
return(next_words)
}
possibilities <- function(q_string) {
#Cleanup.
q_string_split <- clean_up(q_string)[[1]]
q_string <- paste(q_string_split, collapse = " ")
#Word length of input string.
n <- length(q_string_split)
tables_displayed <- 0
while (n > 0) {
#Get next words from blogs, news, and tweets.
q_words <- retrieve_next_words(q_string_split)
#If there is at least one word...
if (length(q_words > 0)) {
#Update table count.
tables_displayed <- tables_displayed + 1
#Return at most 10 entries of the frequency table.
df_freq <- df_pred_creator(q_words)
if (dim(df_freq)[1] > 10){
print.data.frame(df_freq[1:10,])
} else {
print.data.frame(df_freq)
}
}
#Whether it worked or not, try again with a shortened string.
#Shorten string by removing left most word.
q_string_split <- q_string_split[2:n]
q_string <- paste(q_string_split, collapse = " ")
n <- n-1
#Update word count.
if (n == 0) {
break
}
}
#All string lengths failed to match.
if (tables_displayed == 0) {
return("NO MATCHES")
}
}
#This function will create a Frequency dataframe from next_words.
df_pred_creator <- function(next_words) {
#Convert sorted table to dataframe.
df <- as.data.frame(sort(table(next_words), decreasing = T))
#If there is only one column, change to two columns from index and single column.
if (length(df) == 1) {
df <- cbind(newColName = rownames(df), df)
rownames(df) <- 1:nrow(df)
}
#Change Column Names.
df <- df %>%
`colnames<-`(c("Possible Next Word", "Frequency"))
return(df)
}
print("SETUP COMPLETE")