I am now using the lines from all three documents to predict the next word for a quiz.
For each string:
This data will be eventually used for predicting the next word in a random statement.
See the Appendix for code.
[1] "SETUP COMPLETE"
[1] "REPLACEMENT COMPLETE"
you, I would live and I would ...
I would live and I would ...
would live and I would ...
live and I would ...
Possible Next Word Frequency
1 be 1
and I would ...
Possible Next Word Frequency
1 like 217
2 not 192
3 be 184
4 love 156
5 have 144
6 say 48
7 just 38
8 go 37
9 never 36
10 rather 34
I would ...
Possible Next Word Frequency
1 like 5801
2 have 4667
3 be 4589
4 not 4300
5 love 3995
6 say 2048
7 rather 1722
8 never 1368
9 do 1006
10 go 645
and he started telling me about his ...
Possible Next Word Frequency
1 visit 1
he started telling me about his ...
Possible Next Word Frequency
1 visit 1
started telling me about his ...
Possible Next Word Frequency
1 visit 1
telling me about his ...
Possible Next Word Frequency
1 daughters 1
2 dream 1
3 favorite 1
4 one 1
5 past 1
6 soccer 1
7 struggles 1
8 trip 1
9 visit 1
me about his ...
Possible Next Word Frequency
1 day 3
2 new 3
3 past 2
4 2002 1
5 addiction 1
6 amazon 1
7 and 1
8 behavior 1
9 birthday 1
10 breaking 1
give anything to see arctic monkeys this ...
anything to see arctic monkeys this ...
to see arctic monkeys this ...
see arctic monkeys this ...
arctic monkeys this ...
monkeys this ...
this ...
Possible Next Word Frequency
1 is 23494
2 year 18404
3 week 15118
4 morning 10402
5 weekend 9197
6 time 7463
7 one 6318
8 season 5551
9 was 4948
10 month 4101
as a hug and helps reduce your ...
a hug and helps reduce your ...
hug and helps reduce your ...
and helps reduce your ...
helps reduce your ...
reduce your ...
Possible Next Word Frequency
1 risk 19
2 exposure 5
3 costs 4
4 stress 4
5 debt 3
6 energy 3
7 carbon 2
8 chance 2
9 chances 2
10 credit 2
your ...
Possible Next Word Frequency
1 own 5198
2 life 4222
3 favorite 4035
4 friends 1966
5 day 1885
6 heart 1816
7 way 1591
8 head 1536
9 name 1527
10 mind 1525
but you hadn't time to take a ...
you hadn't time to take a ...
hadn't time to take a ...
time to take a ...
Possible Next Word Frequency
1 look 14
2 break 10
3 nap 9
4 of 6
5 stand 4
6 closer 3
7 deep 3
8 few 3
9 long 3
10 picture 3
to take a ...
Possible Next Word Frequency
1 of 672
2 look 265
3 picture 209
4 break 203
5 nap 164
6 few 126
7 to 117
8 shower 109
9 the 90
10 step 83
take a ...
Possible Next Word Frequency
1 of 1340
2 look 1124
3 picture 461
4 nap 435
5 break 368
6 few 366
7 to 360
8 the 320
9 from 242
10 shower 241
evidence, and a jury to settle the ...
and a jury to settle the ...
a jury to settle the ...
jury to settle the ...
to settle the ...
Possible Next Word Frequency
1 case 18
2 issue 6
3 differences 5
4 dispute 5
5 matter 5
6 lawsuit 4
7 score 4
8 suit 3
9 cases 2
10 charges 2
settle the ...
Possible Next Word Frequency
1 case 21
2 issue 12
3 differences 9
4 matter 7
5 dispute 6
6 bill 5
7 score 5
8 lawsuit 4
9 question 4
10 clam 3
number of bags of groceries in each ...
of bags of groceries in each ...
bags of groceries in each ...
Possible Next Word Frequency
1 hand 1
of groceries in each ...
Possible Next Word Frequency
1 hand 1
groceries in each ...
Possible Next Word Frequency
1 hand 1
in each ...
Possible Next Word Frequency
1 of 553
2 others 107
3 other 86
4 direction 60
5 category 52
6 state 37
7 case 36
8 and 35
9 sc 33
10 city 32
is perfect from the bottom to the ...
perfect from the bottom to the ...
from the bottom to the ...
Possible Next Word Frequency
1 top 7
the bottom to the ...
Possible Next Word Frequency
1 top 10
bottom to the ...
Possible Next Word Frequency
1 top 11
2 beginning 1
3 St 1
to the ...
Possible Next Word Frequency
1 next 1879
2 public 1560
3 point 1543
4 world 1488
5 US 1327
6 top 1280
7 new 1211
8 city 1098
9 end 1058
10 game 1056
filled with imagination and bruises from playing ...
with imagination and bruises from playing ...
imagination and bruises from playing ...
and bruises from playing ...
bruises from playing ...
from playing ...
Possible Next Word Frequency
1 the 18
2 in 17
3 with 13
4 at 7
5 basketball 7
6 for 6
7 a 4
8 on 4
9 though 4
10 and 3
playing ...
Possible Next Word Frequency
1 with 1684
2 the 1553
3 in 1535
4 a 1074
5 at 744
6 for 629
7 on 618
8 and 372
9 time 353
10 field 300
are in amst all of Adam Sandler's ...
in amst all of Adam Sandler's ...
amst all of Adam Sandler's ...
all of Adam Sandler's ...
of Adam Sandler's ...
Adam Sandler's ...
Possible Next Word Frequency
1 Jack 2
2 50 1
3 Bedtime 1
4 body 1
5 films 1
6 grandma 1
7 hisher 1
8 Lunchlady 1
9 recent 1
10 résumé 1
Sandler's ...
Possible Next Word Frequency
1 character 2
2 Jack 2
3 50 1
4 beds 1
5 Bedtime 1
6 body 1
7 brand 1
8 deal 1
9 double 1
10 films 1
For the ten inputs and corpus of blogs, news, and tweets:
Accurate prediction is extremely complex, and could be based off a reference word much earlier in the statement.
Also, proper nouns like "arctic monkeys" in 3. and Adam Sandler" in 10 are extremely specific and rare to match accurately.
#Function:
#"’" is replaced with "'", contractions are expanded.
#"U.S." is replaced with "US".
#"a.m." is replaced with "am".
#"p.m." is replaced with "pm".
replacement <- function(data) {
temp <- replace_contraction(gsub("’", "'", data),
contraction.key = lexicon::key_contractions)
temp <- gsub("U.S.", "US", temp)
temp <- gsub("a.m.", "am", temp)
temp <- gsub("p.m.", "pm", temp)
return(temp)
}
#Function returns the word immediately after a string for all lines that contain the string."
retrieve_next_words <- function(input_string) {
#Word length of input string.
n <- length(strsplit(input_string, " ")[[1]])
#Print string and whitespace.
cat(paste('\n\n', input_string, '...\n'))
#Final list.
next_words <- NULL
for (bnt in bnts) {
#Check if the string is in the line.
if(grepl(input_string, bnt)) {
#Find where in the line it is located.
beginning <- str_locate(bnt, input_string)[1]
#Start from where the string is, split by space, grab the next word.
next_word <- strsplit(substring(bnt, beginning), " ")[[1]][n+1]
#Get rid of non-alphanumeric.
next_word <- str_replace_all(next_word, regex("\\W+"), "")
#Update final list.
next_words <- c(next_words, next_word)
}
}
return(next_words)
}
possibilities <- function(q_string){
#Cleanup.
q_string <- replacement(q_string)
#Word length of input string.
temp_str <- str_split(q_string, " ")[[1]]
n <- length(temp_str)
#Start with last 7 words.
if (n > 7) {
q_string <- paste(temp_str[(n-6):n], collapse = ' ')
}
tables_displayed <- 0
while (nchar(q_string) > 0) {
#Get next words from blogs, news, and tweets.
q_words <- retrieve_next_words(q_string)
#If there is at least one word...
if (length(q_words > 0)) {
#Update table count.
tables_displayed <- tables_displayed + 1
#Return at most 10 entries of the frequency table.
df_freq <- df_pred_creator(q_words)
if (dim(df_freq)[1] > 10){
print.data.frame(df_freq[1:10,])
} else {
print.data.frame(df_freq)
}
}
#Whether it worked or not, try again with a shortened string.
#Shorten string by removing left most word.
q_string <- substring(q_string, str_locate(q_string, " ")[1]+1)
#Update word count.
if (is.na(q_string)) {
break
} else {
temp_str <- str_split(q_string, " ")[[1]]
n <- length(temp_str)
#If there are at least two tables and only one word left, break.
if ((tables_displayed > 1) & (n == 1)) {
break
}
}
if (tables_displayed == 5) {
break
}
}
#All string lengths failed to match.
if (tables_displayed == 0) {
return("NO MATCHES")
}
}
#This function will create a Frequency dataframe from next_words.
df_pred_creator <- function(next_words) {
#Convert sorted table to dataframe.
df <- as.data.frame(sort(table(next_words), decreasing = T))
#If there is only one column, change to two columns from index and single column.
if (length(df) == 1) {
df <- cbind(newColName = rownames(df), df)
rownames(df) <- 1:nrow(df)
}
#Change Column Names.
df <- df %>%
`colnames<-`(c("Possible Next Word", "Frequency"))
return(df)
}
print("SETUP COMPLETE")