I. Background

I am now removing stopwords from all parts of the analysis.

For each string:

Contractions are expanded and stopwords are removed.
The new string is compared to all lines from the blogs, news, and tweets, which also have expanded contractions and removed stopwords.
If there are matches, the next word(s) are saved and output as a frequency table.
Whether or not there were matches, the left most word is sliced off the string and the steps are repeated.
At most, the top ten words will be output in a frequency table.

This data will be eventually used for predicting the next word in a random statement.

See the Appendix for code.

These are the stop words removed.

 [1] "the"  "of"   "and"  "a"    "to"   "in"   "is"   "you"  "that" "it"  
[11] "he"   "was"  "for"  "on"   "are"  "as"   "with" "his"  "they" "I"   
[21] "at"   "be"   "this" "have" "from"

[1] "SETUP COMPLETE"

II. Predictions

[1] "REPLACEMENT COMPLETE"

1. “When you breathe, I want to be the air for you. I’ll be there for you, I’d live and I’d…”



 when breathe want air will there would live would ...


 breathe want air will there would live would ...


 want air will there would live would ...


 air will there would live would ...


 will there would live would ...


 there would live would ...


 would live would ...
  Possible Next Word Frequency
1           progress         1


 live would ...
   Possible Next Word Frequency
1                 not         2
2                able         1
3                been         1
4                 buy         1
5              change         1
6        congratulate         1
7            drooling         1
8           exclusive         1
9           exhibited         1
10               give         1


 would ...
   Possible Next Word Frequency
1                like      4822
2                 not      4688
3                been      2230
4                love      1693
5               never      1276
6                 say      1273
7                  do      1178
8                make      1168
9                take       858
10                 go       797

2. “Guy at my table’s wife got up to go to the bathroom and I asked about dessert and he started telling me about his…”



 guy my tables wife got up go bathroom asked about dessert started telling me about ...


 my tables wife got up go bathroom asked about dessert started telling me about ...


 tables wife got up go bathroom asked about dessert started telling me about ...


 wife got up go bathroom asked about dessert started telling me about ...


 got up go bathroom asked about dessert started telling me about ...


 up go bathroom asked about dessert started telling me about ...


 go bathroom asked about dessert started telling me about ...


 bathroom asked about dessert started telling me about ...


 asked about dessert started telling me about ...


 about dessert started telling me about ...


 dessert started telling me about ...


 started telling me about ...
  Possible Next Word Frequency
1                her         2
2              first         1
3            utility         1
4              visit         1


 telling me about ...
   Possible Next Word Frequency
1                 her         6
2                 how         4
3                 one         3
4                  my         2
5                 she         2
6                some         2
7                them         2
8                your         2
9                 all         1
10              being         1


 me about ...
   Possible Next Word Frequency
1                 how        56
2                  my        54
3                 her        38
4                your        36
5                what        34
6                when        28
7             because        22
8                  an        19
9               their        19
10                but        18


 about ...
   Possible Next Word Frequency
1                 how      2709
2                  my      2284
3                what      2196
4                 her      1131
5               their       980
6                them       882
7                your       876
8                 our       820
9                 all       807
10                 me       725

3. “I’d give anything to see arctic monkeys this…”



 would give anything see arctic monkeys ...


 give anything see arctic monkeys ...


 anything see arctic monkeys ...


 see arctic monkeys ...


 arctic monkeys ...
  Possible Next Word Frequency
1               bass         1
2              blink         1
3            classic         1
4          hopefully         1
5             humbug         1
6        immediately         1
7             really         1


 monkeys ...
   Possible Next Word Frequency
1                  or         5
2                were         4
3                 who         4
4                  an         3
5                 but         3
6                 one         3
7              really         3
8               their         3
9                when         3
10                 am         2

4. “Talking to your mom has the same effect as a hug and helps reduce your…”



 talking your mom has same effect hug helps reduce your ...


 your mom has same effect hug helps reduce your ...


 mom has same effect hug helps reduce your ...


 has same effect hug helps reduce your ...


 same effect hug helps reduce your ...


 effect hug helps reduce your ...


 hug helps reduce your ...


 helps reduce your ...


 reduce your ...
   Possible Next Word Frequency
1                risk        11
2                debt         3
3         electricity         2
4               needs         2
5                time         2
6             anxiety         1
7              budget         1
8              chance         1
9             chances         1
10              costs         1


 your ...
   Possible Next Word Frequency
1                 own      2504
2                life      1485
3            favorite       856
4               heart       662
5                mind       538
6                blog       536
7                eyes       519
8                head       491
9               child       490
10               body       467

5. “When you were in Holland you were like 1 inch away from me but you hadn’t time to take a…”



 when were holland were like 1 inch away me but hadnt time take ...


 were holland were like 1 inch away me but hadnt time take ...


 holland were like 1 inch away me but hadnt time take ...


 were like 1 inch away me but hadnt time take ...


 like 1 inch away me but hadnt time take ...


 1 inch away me but hadnt time take ...


 inch away me but hadnt time take ...


 away me but hadnt time take ...


 me but hadnt time take ...


 but hadnt time take ...


 hadnt time take ...


 time take ...
   Possible Next Word Frequency
1                care        18
2                look        12
3               break         7
4                  me         7
5              action         6
6                some         6
7                  up         6
8                your         6
9                 get         5
10               back         4


 take ...
   Possible Next Word Frequency
1                care      1287
2               place      1281
3                  me      1028
4                  my       931
5                time       842
6                look       734
7                over       716
8                  up       711
9                 out       698
10               some       672

6. “I’d just like all of these questions answered, a presentation of evidence, and a jury to settle the…”



 would just like all these questions answered presentation evidence jury settle ...


 just like all these questions answered presentation evidence jury settle ...


 like all these questions answered presentation evidence jury settle ...


 all these questions answered presentation evidence jury settle ...


 these questions answered presentation evidence jury settle ...


 questions answered presentation evidence jury settle ...


 answered presentation evidence jury settle ...


 presentation evidence jury settle ...


 evidence jury settle ...


 jury settle ...


 settle ...
   Possible Next Word Frequency
1                down       282
2                into       192
3                back        42
4                  my        35
5                  by        33
6                 her        23
7                some        19
8                 our        18
9                 but        17
10                one        16

7. “I can’t deal with unsymetrical things. I can’t even hold an uneven number of bags of groceries in each…”



 can not deal unsymetrical things can not even hold an uneven number bags groceries each ...


 not deal unsymetrical things can not even hold an uneven number bags groceries each ...


 deal unsymetrical things can not even hold an uneven number bags groceries each ...


 unsymetrical things can not even hold an uneven number bags groceries each ...


 things can not even hold an uneven number bags groceries each ...


 can not even hold an uneven number bags groceries each ...


 not even hold an uneven number bags groceries each ...


 even hold an uneven number bags groceries each ...


 hold an uneven number bags groceries each ...


 an uneven number bags groceries each ...


 uneven number bags groceries each ...


 number bags groceries each ...


 bags groceries each ...


 groceries each ...


 each ...
   Possible Next Word Frequency
1               other      4242
2                 day      1037
3                 one       706
4                time       632
5                 out       622
6               every       575
7                year       553
8                  us       541
9                them       466
10                 my       443

8. “Every inch of you is perfect from the bottom to the…”



 every inch perfect bottom ...


 inch perfect bottom ...


 perfect bottom ...


 bottom ...
   Possible Next Word Frequency
1                line       331
2                  my       125
3                post        98
4                 pan        73
5                page        56
6               right        56
7                left        53
8                your        46
9                  up        44
10               each        40

9. “I’m thankful my childhood was filled with imagination and bruises from playing…”



 am thankful my childhood filled imagination bruises playing ...


 thankful my childhood filled imagination bruises playing ...


 my childhood filled imagination bruises playing ...


 childhood filled imagination bruises playing ...


 filled imagination bruises playing ...


 imagination bruises playing ...


 bruises playing ...


 playing ...
   Possible Next Word Frequency
1              around       137
2                  my       125
3                game       103
4               along       102
5               games        94
6               field        88
7                 her        82
8               their        66
9               music        61
10               some        60

10. “I like how the same people are in almost all of Adam Sandler’s…”



 like how same people amst all adam sandlers ...


 how same people amst all adam sandlers ...


 same people amst all adam sandlers ...


 people amst all adam sandlers ...


 amst all adam sandlers ...


 all adam sandlers ...


 adam sandlers ...


 sandlers ...

[1] "NO MATCHES"

III. Summary

Although it runs faster, the discrepancy between these answers and those of Sentence Fragments and NLP II suggests that removing stopwords is a terrible idea.

A good example is “take care” and “take a look”.

Stopwords are extremely useful in common phrases.

Appendix

Setup

knitr::opts_chunk$set(comment = NA)
options("scipen" = 100)
options(java.parameters = "-Xmx8g")

Packages

library(dplyr)
library(readr)
library(RWeka)
library(stringr)
library(textclean)
library(tm)

Read Files

#Files are read.  Non graphical characters are removed
blogs <- read_lines(file = "en_US.blogs.txt")
blogs = iconv(blogs, "UTF-8", "ASCII", sub = "byte")

news <- read_lines(file = "en_US.news.txt")
news = iconv(news, "UTF-8", "ASCII", sub = "byte")

tweets <- read_lines(file = "en_US.twitter.txt")
tweets = iconv(tweets, "UTF-8", "ASCII", sub = "byte")

Stopwords

Top25Words

Helper Functions

#Text Cleanup Function:
clean_up <- function(data) {
  
  #"’" is replaced with "'", contractions are expanded.
  temp <- replace_contraction(gsub("’", "'", data),
                              contraction.key = lexicon::key_contractions)
  #"U.S." is replaced with "US".
  temp <- gsub("U.S.", "US", temp)
  #"a.m." is replaced with "am".
  temp <- gsub("a.m.", "am", temp)
  #"p.m." is replaced with "pm".
  temp <- gsub("p.m.", "pm", temp)
  #Remove punctuations.
  temp <- removePunctuation(temp)
  #Remove stopwords.
  temp <- rm_stopwords(temp)

  return(temp)
}



#Function returns the word immediately after a string for all lines that contain the string."
retrieve_next_words <- function(input_string_split) {

  #Word length of input string.
  n <- length(input_string_split)
  
  #Recombine into string
  input_string <- paste(input_string_split, collapse = " ")
  
  #Print string and whitespace.
  cat(paste('\n\n', input_string, '...\n'))

  #Final list.
  next_words <- NULL
  
  for (bnt in bnts) {
    #Check if the string is in the line.
    if(grepl(input_string, bnt)) {
      #Find where in the line it is located.
      beginning <- str_locate(bnt, input_string)[1]
      #Start from where the string is, split by space, grab the next word.
      next_word <- strsplit(substring(bnt, beginning), " ")[[1]][n+1]
      #Get rid of non-alphanumeric.
      next_word <- str_replace_all(next_word, regex("\\W+"), "")
      #Update final list.
      next_words <- c(next_words, next_word)
    }
  }
  return(next_words)
}



possibilities <- function(q_string) {
  
  #Cleanup.
  q_string_split <- clean_up(q_string)[[1]]
  q_string <- paste(q_string_split, collapse = " ")
  #Word length of input string.
  n <- length(q_string_split)
  
  tables_displayed <- 0
  while (n > 0) {

    #Get next words from blogs, news, and tweets.
    q_words <- retrieve_next_words(q_string_split)
    
    #If there is at least one word...
    if (length(q_words > 0)) {
      
      #Update table count.
      tables_displayed <- tables_displayed + 1
      
      #Return at most 10 entries of the frequency table.

      df_freq <- df_pred_creator(q_words)
      
      if (dim(df_freq)[1] > 10){
        print.data.frame(df_freq[1:10,])
      } else {
        print.data.frame(df_freq)
      }
    }
    
    #Whether it worked or not, try again with a shortened string.
    #Shorten string by removing left most word.
    q_string_split <- q_string_split[2:n]
      q_string <- paste(q_string_split, collapse = " ")
    n <- n-1
    #Update word count.
    if (n == 0) {
      break
    } 
  }
  
  #All string lengths failed to match.
  if (tables_displayed == 0) {
    return("NO MATCHES")
  }
      
}



#This function will create a Frequency dataframe from next_words.
df_pred_creator <- function(next_words) {

  #Convert sorted table to dataframe.
  df <- as.data.frame(sort(table(next_words), decreasing = T))
  
  #If there is only one column, change to two columns from index and single column.
  if (length(df) == 1) {
    
    df <- cbind(newColName = rownames(df), df)
    rownames(df) <- 1:nrow(df)
  }
  
  #Change Column Names.
  df <- df %>%
    `colnames<-`(c("Possible Next Word", "Frequency"))
  
  return(df)
}

print("SETUP COMPLETE")

II. Predictions

blogs <- replacement(blogs)

news <- replacement(news)

tweets <- replacement(tweets)

bnts <- c(blogs, news, tweets)

print("REPLACEMENT COMPLETE")

1.

q1_string <- "When you breathe, I want to be the air for you. I'll be there for you, I'd live and I'd"

possibilities(q1_string)

2.

q2_string <- "Guy at my table's wife got up to go to the bathroom and I asked about dessert and he started telling me about his"

possibilities(q2_string)

3.

q3_string <- "I'd give anything to see arctic monkeys this"

possibilities(q3_string)

4.

q4_string <- "Talking to your mom has the same effect as a hug and helps reduce your"

possibilities(q4_string)

5.

q5_string <- "When you were in Holland you were like 1 inch away from me but you hadn't time to take a"

possibilities(q5_string)

6.

q6_string <- "I'd just like all of these questions answered, a presentation of evidence, and a jury to settle the"

possibilities(q6_string)

7.

q7_string <- "I can't deal with unsymetrical things. I can't even hold an uneven number of bags of groceries in each"

possibilities(q7_string)

8.

q8_string <- "Every inch of you is perfect from the bottom to the"

possibilities(q8_string)

9.

q9_string <- "I’m thankful my childhood was filled with imagination and bruises from playing"

possibilities(q9_string)

10.

q10_string <- "I like how the same people are in almost all of Adam Sandler's"

possibilities(q10_string)

Data Science Capstone

Sentence Fragments and NLP III

Rohan Lewis

2020.12.16

I. Background

II. Predictions

1. “When you breathe, I want to be the air for you. I’ll be there for you, I’d live and I’d…”

2. “Guy at my table’s wife got up to go to the bathroom and I asked about dessert and he started telling me about his…”

3. “I’d give anything to see arctic monkeys this…”

4. “Talking to your mom has the same effect as a hug and helps reduce your…”

5. “When you were in Holland you were like 1 inch away from me but you hadn’t time to take a…”

6. “I’d just like all of these questions answered, a presentation of evidence, and a jury to settle the…”

7. “I can’t deal with unsymetrical things. I can’t even hold an uneven number of bags of groceries in each…”

8. “Every inch of you is perfect from the bottom to the…”

9. “I’m thankful my childhood was filled with imagination and bruises from playing…”

10. “I like how the same people are in almost all of Adam Sandler’s…”

III. Summary

Appendix

Setup

Packages

Read Files

Stopwords

Helper Functions

II. Predictions

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.