I. Background

I am now using the lines from all three documents to predict the next word for a quiz.

For each string:

It is first sliced to the last (right most) 7 words.
The new string is compared to all lines from the blogs, news, and tweets.
If there are matches, the next word(s) are saved and output as a frequency table.
Whether or not there were matches, the left most word is sliced off the string and the steps are repeated.
At most, the top ten words will be output in a frequency table.
At most, five frequency tables will be output per initial string.
If at least two frequency tables have been output, the algorithm will break if only one word is left in the initial string. Two of the initial strings end with “the” and “a” and was taking 8+ hours on creating the frequency table on that segment alone.

This data will be eventually used for predicting the next word in a random statement.

See the Appendix for code.

[1] "SETUP COMPLETE"

II. Predictions

[1] "REPLACEMENT COMPLETE"

1. “When you breathe, I want to be the air for you. I’ll be there for you, I’d live and I’d…”



 you, I would live and I would ...


 I would live and I would ...


 would live and I would ...


 live and I would ...
  Possible Next Word Frequency
1                 be         1


 and I would ...
   Possible Next Word Frequency
1                like       217
2                 not       192
3                  be       184
4                love       156
5                have       144
6                 say        48
7                just        38
8                  go        37
9               never        36
10             rather        34


 I would ...
   Possible Next Word Frequency
1                like      5801
2                have      4667
3                  be      4589
4                 not      4300
5                love      3995
6                 say      2048
7              rather      1722
8               never      1368
9                  do      1006
10                 go       645

2. “Guy at my table’s wife got up to go to the bathroom and I asked about dessert and he started telling me about his…”



 and he started telling me about his ...
  Possible Next Word Frequency
1              visit         1


 he started telling me about his ...
  Possible Next Word Frequency
1              visit         1


 started telling me about his ...
  Possible Next Word Frequency
1              visit         1


 telling me about his ...
  Possible Next Word Frequency
1          daughters         1
2              dream         1
3           favorite         1
4                one         1
5               past         1
6             soccer         1
7          struggles         1
8               trip         1
9              visit         1


 me about his ...
   Possible Next Word Frequency
1                 day         3
2                 new         3
3                past         2
4                2002         1
5           addiction         1
6              amazon         1
7                 and         1
8            behavior         1
9            birthday         1
10           breaking         1

3. “I’d give anything to see arctic monkeys this…”



 give anything to see arctic monkeys this ...


 anything to see arctic monkeys this ...


 to see arctic monkeys this ...


 see arctic monkeys this ...


 arctic monkeys this ...


 monkeys this ...


 this ...
   Possible Next Word Frequency
1                  is     23494
2                year     18404
3                week     15118
4             morning     10402
5             weekend      9197
6                time      7463
7                 one      6318
8              season      5551
9                 was      4948
10              month      4101

4. “Talking to your mom has the same effect as a hug and helps reduce your…”



 as a hug and helps reduce your ...


 a hug and helps reduce your ...


 hug and helps reduce your ...


 and helps reduce your ...


 helps reduce your ...


 reduce your ...
   Possible Next Word Frequency
1                risk        19
2            exposure         5
3               costs         4
4              stress         4
5                debt         3
6              energy         3
7              carbon         2
8              chance         2
9             chances         2
10             credit         2


 your ...
   Possible Next Word Frequency
1                 own      5198
2                life      4222
3            favorite      4035
4             friends      1966
5                 day      1885
6               heart      1816
7                 way      1591
8                head      1536
9                name      1527
10               mind      1525

5. “When you were in Holland you were like 1 inch away from me but you hadn’t time to take a…”



 but you hadn't time to take a ...


 you hadn't time to take a ...


 hadn't time to take a ...


 time to take a ...
   Possible Next Word Frequency
1                look        14
2               break        10
3                 nap         9
4                  of         6
5               stand         4
6              closer         3
7                deep         3
8                 few         3
9                long         3
10            picture         3


 to take a ...
   Possible Next Word Frequency
1                  of       672
2                look       265
3             picture       209
4               break       203
5                 nap       164
6                 few       126
7                  to       117
8              shower       109
9                 the        90
10               step        83


 take a ...
   Possible Next Word Frequency
1                  of      1340
2                look      1124
3             picture       461
4                 nap       435
5               break       368
6                 few       366
7                  to       360
8                 the       320
9                from       242
10             shower       241

6. “I’d just like all of these questions answered, a presentation of evidence, and a jury to settle the…”



 evidence, and a jury to settle the ...


 and a jury to settle the ...


 a jury to settle the ...


 jury to settle the ...


 to settle the ...
   Possible Next Word Frequency
1                case        18
2               issue         6
3         differences         5
4             dispute         5
5              matter         5
6             lawsuit         4
7               score         4
8                suit         3
9               cases         2
10            charges         2


 settle the ...
   Possible Next Word Frequency
1                case        21
2               issue        12
3         differences         9
4              matter         7
5             dispute         6
6                bill         5
7               score         5
8             lawsuit         4
9            question         4
10               clam         3

7. “I can’t deal with unsymetrical things. I can’t even hold an uneven number of bags of groceries in each…”



 number of bags of groceries in each ...


 of bags of groceries in each ...


 bags of groceries in each ...
  Possible Next Word Frequency
1               hand         1


 of groceries in each ...
  Possible Next Word Frequency
1               hand         1


 groceries in each ...
  Possible Next Word Frequency
1               hand         1


 in each ...
   Possible Next Word Frequency
1                  of       553
2              others       107
3               other        86
4           direction        60
5            category        52
6               state        37
7                case        36
8                 and        35
9                  sc        33
10               city        32

8. “Every inch of you is perfect from the bottom to the…”



 is perfect from the bottom to the ...


 perfect from the bottom to the ...


 from the bottom to the ...
  Possible Next Word Frequency
1                top         7


 the bottom to the ...
  Possible Next Word Frequency
1                top        10


 bottom to the ...
  Possible Next Word Frequency
1                top        11
2          beginning         1
3                 St         1


 to the ...
   Possible Next Word Frequency
1                next      1879
2              public      1560
3               point      1543
4               world      1488
5                  US      1327
6                 top      1280
7                 new      1211
8                city      1098
9                 end      1058
10               game      1056

9. “I’m thankful my childhood was filled with imagination and bruises from playing…”



 filled with imagination and bruises from playing ...


 with imagination and bruises from playing ...


 imagination and bruises from playing ...


 and bruises from playing ...


 bruises from playing ...


 from playing ...
   Possible Next Word Frequency
1                 the        18
2                  in        17
3                with        13
4                  at         7
5          basketball         7
6                 for         6
7                   a         4
8                  on         4
9              though         4
10                and         3


 playing ...
   Possible Next Word Frequency
1                with      1684
2                 the      1553
3                  in      1535
4                   a      1074
5                  at       744
6                 for       629
7                  on       618
8                 and       372
9                time       353
10              field       300

10. “I like how the same people are in almost all of Adam Sandler’s…”



 are in amst all of Adam Sandler's ...


 in amst all of Adam Sandler's ...


 amst all of Adam Sandler's ...


 all of Adam Sandler's ...


 of Adam Sandler's ...


 Adam Sandler's ...
   Possible Next Word Frequency
1                Jack         2
2                  50         1
3             Bedtime         1
4                body         1
5               films         1
6             grandma         1
7              hisher         1
8           Lunchlady         1
9              recent         1
10             résumé         1


 Sandler's ...
   Possible Next Word Frequency
1           character         2
2                Jack         2
3                  50         1
4                beds         1
5             Bedtime         1
6                body         1
7               brand         1
8                deal         1
9              double         1
10              films         1

III. Summary

For the ten inputs and corpus of blogs, news, and tweets:

2. matched for seven words.
7. and 8. matched for five words
1. and 5. matched for four words.
6. matched for three words.
4., 9., and 10. matched for two words.
3. matched for only one word.

Accurate prediction is extremely complex, and could be based off a reference word much earlier in the statement.

Also, proper nouns like "arctic monkeys" in 3. and Adam Sandler" in 10 are extremely specific and rare to match accurately.

Appendix

Setup

knitr::opts_chunk$set(comment = NA)
options("scipen" = 100)
options(java.parameters = "-Xmx8g")

Packages

library(dplyr)
library(readr)
library(RWeka)
library(stringr)
library(textclean)
library(tm)

Read Files

#Files are read.

blogs <- read_lines(file = "en_US.blogs.txt")

news <- read_lines(file = "en_US.news.txt")

tweets <- read_lines(file = "en_US.twitter.txt")

Helper Functions

#Function:
#"’" is replaced with "'", contractions are expanded.
#"U.S." is replaced with "US".
#"a.m." is replaced with "am".
#"p.m." is replaced with "pm".
replacement <- function(data) {
  
  temp <- replace_contraction(gsub("’", "'", data),
                              contraction.key = lexicon::key_contractions)
  temp <- gsub("U.S.", "US", temp)
  temp <- gsub("a.m.", "am", temp)
  temp <- gsub("p.m.", "pm", temp)
  return(temp)
}



#Function returns the word immediately after a string for all lines that contain the string."
retrieve_next_words <- function(input_string) {

  #Word length of input string.
  n <- length(strsplit(input_string, " ")[[1]])
  
  #Print string and whitespace.
  cat(paste('\n\n', input_string, '...\n'))

  #Final list.
  next_words <- NULL
  
  for (bnt in bnts) {
    #Check if the string is in the line.
    if(grepl(input_string, bnt)) {
      #Find where in the line it is located.
      beginning <- str_locate(bnt, input_string)[1]
      #Start from where the string is, split by space, grab the next word.
      next_word <- strsplit(substring(bnt, beginning), " ")[[1]][n+1]
      #Get rid of non-alphanumeric.
      next_word <- str_replace_all(next_word, regex("\\W+"), "")
      #Update final list.
      next_words <- c(next_words, next_word)
    }
  }
  return(next_words)
}



possibilities <- function(q_string){
  
  #Cleanup.
  q_string <- replacement(q_string)
  #Word length of input string.
  temp_str <- str_split(q_string, " ")[[1]]
  n <- length(temp_str)
  
  #Start with last 7 words.
  if (n > 7) {
    q_string <- paste(temp_str[(n-6):n], collapse = ' ')    
  }

  tables_displayed <- 0
  while (nchar(q_string) > 0) {

    #Get next words from blogs, news, and tweets.
    q_words <- retrieve_next_words(q_string)
    
    #If there is at least one word...
    if (length(q_words > 0)) {
      
      #Update table count.
      tables_displayed <- tables_displayed + 1
      
      #Return at most 10 entries of the frequency table.

      df_freq <- df_pred_creator(q_words)
      
      if (dim(df_freq)[1] > 10){
        print.data.frame(df_freq[1:10,])
      } else {
        print.data.frame(df_freq)
      }
    }
    
    #Whether it worked or not, try again with a shortened string.
    #Shorten string by removing left most word.
    q_string <- substring(q_string, str_locate(q_string, " ")[1]+1)
    
    #Update word count.
    if (is.na(q_string)) {
      break
    } else {
      temp_str <- str_split(q_string, " ")[[1]]
      n <- length(temp_str)
      
      #If there are at least two tables and only one word left, break.
      if ((tables_displayed > 1) & (n == 1)) {
        break
      }
    }
    
    if (tables_displayed == 5) {
      break
    } 
  }
  
  #All string lengths failed to match.
  if (tables_displayed == 0) {
    return("NO MATCHES")
  }
}



#This function will create a Frequency dataframe from next_words.
df_pred_creator <- function(next_words) {

  #Convert sorted table to dataframe.
  df <- as.data.frame(sort(table(next_words), decreasing = T))
  
  #If there is only one column, change to two columns from index and single column.
  if (length(df) == 1) {
    
    df <- cbind(newColName = rownames(df), df)
    rownames(df) <- 1:nrow(df)
  }
  
  #Change Column Names.
  df <- df %>%
    `colnames<-`(c("Possible Next Word", "Frequency"))
  
  return(df)
}

print("SETUP COMPLETE")

II. Predictions

blogs <- replacement(blogs)

news <- replacement(news)

tweets <- replacement(tweets)

bnts <- c(blogs, news, tweets)

print("REPLACEMENT COMPLETE")

1.

q1_string <- "When you breathe, I want to be the air for you. I'll be there for you, I'd live and I'd"

possibilities(q1_string)

2.

q2_string <- "Guy at my table's wife got up to go to the bathroom and I asked about dessert and he started telling me about his"

possibilities(q2_string)

3.

q3_string <- "I'd give anything to see arctic monkeys this"

possibilities(q3_string)

4.

q4_string <- "Talking to your mom has the same effect as a hug and helps reduce your"

possibilities(q4_string)

5.

q5_string <- "When you were in Holland you were like 1 inch away from me but you hadn't time to take a"

possibilities(q5_string)

6.

q6_string <- "I'd just like all of these questions answered, a presentation of evidence, and a jury to settle the"

possibilities(q6_string)

7.

q7_string <- "I can't deal with unsymetrical things. I can't even hold an uneven number of bags of groceries in each"

possibilities(q7_string)

8.

q8_string <- "Every inch of you is perfect from the bottom to the"

possibilities(q8_string)

9.

q9_string <- "I’m thankful my childhood was filled with imagination and bruises from playing"

possibilities(q9_string)

10.

q10_string <- "I like how the same people are in almost all of Adam Sandler's"

possibilities(q10_string)

Data Science Capstone

Sentence Fragments and NLP II

Rohan Lewis

2020.12.13

I. Background

II. Predictions

1. “When you breathe, I want to be the air for you. I’ll be there for you, I’d live and I’d…”

2. “Guy at my table’s wife got up to go to the bathroom and I asked about dessert and he started telling me about his…”

3. “I’d give anything to see arctic monkeys this…”

4. “Talking to your mom has the same effect as a hug and helps reduce your…”

5. “When you were in Holland you were like 1 inch away from me but you hadn’t time to take a…”

6. “I’d just like all of these questions answered, a presentation of evidence, and a jury to settle the…”

7. “I can’t deal with unsymetrical things. I can’t even hold an uneven number of bags of groceries in each…”

8. “Every inch of you is perfect from the bottom to the…”

9. “I’m thankful my childhood was filled with imagination and bruises from playing…”

10. “I like how the same people are in almost all of Adam Sandler’s…”

III. Summary

Appendix

Setup

Packages

Read Files

Helper Functions

II. Predictions

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.