I. Background

Continuing from the previous document, I am now exploring nGrams of the three files.

Due to size limitations, only the first 25% of each dataset is being used.

This data will be eventually used for predicting the next word in a statement.

See the Appendix for code.



II. n-Gram Distributions

Contractions have been expanded.

A. Blogs

1. Trigrams

         nGram Frequency
1     I do not     23824
2     I am not     12932
3   one of the     11892
4  I have been     11664
5      it is a     11400
6    I did not     11364
7     a lot of     11348
8    I can not      9048
9    it is not      8964
10 do not know      7600

2. 4-grams

             nGram Frequency
1    I do not know      4816
2    I am going to      4512
3   the end of the      3152
4   do not want to      2908
5  the rest of the      2732
6   I do not think      2680
7  I would like to      2528
8    at the end of      2412
9    I am not sure      2348
10   I do not have      2300

3. 5-grams

                                         nGram Frequency
1                             I do not want to      1476
2                            at the end of the      1232
3  interests Vested interests Vested interests      1000
4     Vested interests Vested interests Vested      1000
5                            I can not wait to       824
6                           I do not know what       804
7                            I am not going to       736
8                         in the middle of the       736
9                             I do not know if       720
10                            I do not think I       720

4. 6-grams

                                                nGram Frequency
1  Vested interests Vested interests Vested interests      1000
2  interests Vested interests Vested interests Vested       996
3                               at the end of the day       264
4                               I can not wait to see       264
5                                  I do not know if I       224
6                            on the other side of the       224
7                 background none repeat scroll 0% 0%       220
8                     none repeat scroll 0% 0% yellow       220
9             style= background none repeat scroll 0%       220
10                                I do not want to be       212

B. News

1. Trigrams

          nGram Frequency
1    one of the     12316
2      a lot of     10164
3       it is a      9772
4      I do not      8912
5     it is not      6748
6    as well as      6108
7   is going to      5696
8  are going to      5448
9    the end of      5388
10   out of the      5348

2. 4-grams

                nGram Frequency
1      the end of the      2676
2      is going to be      2456
3  for the first time      2408
4       at the end of      2208
5       I do not know      2072
6       is one of the      2048
7      I do not think      2028
8     we are going to      2028
9     the rest of the      1996
10     do not want to      1888

3. 5-grams

                      nGram Frequency
1         at the end of the      1044
2         it is going to be       988
3          is going to be a       752
4     for the first time in       628
5          I do not want to       584
6      in the middle of the       584
7  for the first time since       560
8         there is a lot of       500
9        the end of the day       472
10         I do not know if       464

4. 6-grams

                                        nGram Frequency
1                         it is going to be a       328
2            could not be reached for comment       296
3                       at the end of the day       232
4                        I do not think it is       200
5  Centers for Disease Control and Prevention       188
6              on the New York Stock Exchange       180
7                     we are going to have to       180
8                  Rock and Roll Hall of Fame       168
9                      by the end of the year       164
10                  in the first round of the       160

C. Tweets

1. Trigrams

            nGram Frequency
1        I do not     23412
2  Thanks for the     15432
3    can not wait     14084
4       I can not     13320
5        I am not     11464
6       I will be      9184
7         it is a      8520
8  thanks for the      7760
9  for the follow      7556
10    not wait to      7256

2. 4-grams

                   nGram Frequency
1        can not wait to      7168
2          I am going to      5488
3  Thanks for the follow      3928
4         is going to be      3700
5          I do not know      3648
6       can not wait for      2940
7         I can not wait      2888
8         I do not think      2584
9        not wait to see      2568
10     Thanks for the RT      2500

3. 5-grams

                  nGram Frequency
1   can not wait to see      2560
2     I can not wait to      1328
3     it is going to be      1240
4      is going to be a      1108
5      I do not want to      1100
6  can not wait for the       748
7    I do not know what       676
8   not wait to see you       668
9      I do not think I       604
10    at the end of the       600

4. 6-grams

                       nGram Frequency
1    can not wait to see you       664
2        it is going to be a       524
3      I can not wait to see       412
4  Happy Mother s Day to all       368
5    can not wait to see the       264
6   can not wait to see what       256
7      I am going to have to       240
8      I think I am going to       240
9      I am not the only one       232
10   Mother s Day to all the       216

D. All

1. Trigrams

         nGram Frequency
1     I do not     56148
2      it is a     29692
3   one of the     28724
4     I am not     27968
5     a lot of     26944
6    I can not     24340
7    it is not     21764
8    I did not     20824
9  I have been     20172
10 do not know     18512

2. 4-grams

                nGram Frequency
1       I am going to     11296
2       I do not know     10536
3     can not wait to      8632
4      is going to be      7948
5      I do not think      7292
6      do not want to      7268
7      the end of the      7248
8     the rest of the      6108
9       at the end of      5692
10 for the first time      5600

3. 5-grams

                  nGram Frequency
1      I do not want to      3160
2   can not wait to see      3020
3     at the end of the      2876
4     it is going to be      2740
5      is going to be a      2264
6     I can not wait to      2220
7    I do not know what      1832
8  in the middle of the      1660
9      I do not know if      1588
10    I am not going to      1516

4. 6-grams

                                                nGram Frequency
1                                 it is going to be a      1028
2  Vested interests Vested interests Vested interests      1000
3  interests Vested interests Vested interests Vested       996
4                             can not wait to see you       688
5                               I can not wait to see       688
6                               at the end of the day       684
7                                  I do not know if I       488
8                               I am going to have to       484
9                                I do not think it is       476
10                           can not wait to see what       460

Appendix


Setup

knitr::opts_chunk$set(comment = NA)
options("scipen" = 100)
options(java.parameters = "-Xmx8g")

Packages

library(dplyr)
library(readr)
library(RWeka)
library(textclean)
library(tm)

Read Files

#Files are read.

blogs <- read_lines(file = "en_US.blogs.txt")
blogs <- blogs[0:length(blogs) / 4]

news <- read_lines(file = "en_US.news.txt")
news <- news[0:length(news) / 4]

tweets <- read_lines(file = "en_US.twitter.txt")
tweets <- tweets[0:length(tweets) / 4]

Helper Functions

#Function:
#"’" is replaced with "'", contractions are expanded.
#"U.S." is replaced with "US".
#"a.m." is replaced with "am".
#"p.m." is replaced with "pm".
replacement <- function(data) {
  
  temp <- replace_contraction(gsub("’", "'", data),
                              contraction.key = lexicon::key_contractions)
  temp <- gsub("U.S.", "US", temp)
  temp <- gsub("a.m.", "am", temp)
  temp <- gsub("p.m.", "pm", temp)
  return(temp)
}

#This function will list ngrams for a specified n and file.
n_gram_creator <- function(data, n) {
  
  n_grams = NGramTokenizer(data, Weka_control(min = n, max = n))
  
  return(n_grams)
}



#This function will create a Frequency dataframe of n_grams.
df_freq_creator <- function(n_grams) {

  df <- data.frame(sort(table(n_grams), decreasing = T)) %>%
    `colnames<-`(c("nGram", "Frequency"))
  
  return(df)
}

II. n-Gram Distributions

A. Blogs

blogs <- replacement(blogs)

1. Trigrams

blogs3 <- n_gram_creator(blogs, 3)

df_freq_creator(blogs3)[1:10,]

2. 4-grams

blogs4 <- n_gram_creator(blogs, 4)

df_freq_creator(blogs4)[1:10,]

3. 5-grams

blogs5 <- n_gram_creator(blogs, 5)

df_freq_creator(blogs5)[1:10,]

4. 6-grams

blogs6 <- n_gram_creator(blogs, 6)

df_freq_creator(blogs6)[1:10,]

B. News

news <- replacement(news)

1. Trigrams

news3 <- n_gram_creator(news, 3)

df_freq_creator(news3)[1:10,]

2. 4-grams

news4 <- n_gram_creator(news, 4)

df_freq_creator(news4)[1:10,]

3. 5-grams

news5 <- n_gram_creator(news, 5)

df_freq_creator(news5)[1:10,]

4. 6-grams

news6 <- n_gram_creator(news, 6)

df_freq_creator(news6)[1:10,]

C. Tweets

tweets <- replacement(tweets)

1. Trigrams

tweets3 <- n_gram_creator(tweets, 3)

df_freq_creator(tweets3)[1:10,]

2. 4-grams

tweets4 <- n_gram_creator(tweets, 4)

df_freq_creator(tweets4)[1:10,]

3. 5-grams

tweets5 <- n_gram_creator(tweets, 5)

df_freq_creator(tweets5)[1:10,]

4. 6-grams

tweets6 <- n_gram_creator(tweets, 6)

df_freq_creator(tweets6)[1:10,]

D. All

1. Trigrams

df_freq_creator(c(blogs3, news3, tweets3))[1:10,]

2. 4-grams

df_freq_creator(c(blogs4, news4, tweets4))[1:10,]

3. 5-grams

df_freq_creator(c(blogs5, news5, tweets5))[1:10,]

4. 6-grams

df_freq_creator(c(blogs6, news6, tweets6))[1:10,]