Continuing from the previous document, I am now exploring nGrams of the three files.
Due to size limitations, only the first 25% of each dataset is being used.
This data will be eventually used for predicting the next word in a statement.
See the Appendix for code.
Contractions have been expanded.
nGram Frequency
1 I do not 23824
2 I am not 12932
3 one of the 11892
4 I have been 11664
5 it is a 11400
6 I did not 11364
7 a lot of 11348
8 I can not 9048
9 it is not 8964
10 do not know 7600
nGram Frequency
1 I do not know 4816
2 I am going to 4512
3 the end of the 3152
4 do not want to 2908
5 the rest of the 2732
6 I do not think 2680
7 I would like to 2528
8 at the end of 2412
9 I am not sure 2348
10 I do not have 2300
nGram Frequency
1 I do not want to 1476
2 at the end of the 1232
3 interests Vested interests Vested interests 1000
4 Vested interests Vested interests Vested 1000
5 I can not wait to 824
6 I do not know what 804
7 I am not going to 736
8 in the middle of the 736
9 I do not know if 720
10 I do not think I 720
nGram Frequency
1 Vested interests Vested interests Vested interests 1000
2 interests Vested interests Vested interests Vested 996
3 at the end of the day 264
4 I can not wait to see 264
5 I do not know if I 224
6 on the other side of the 224
7 background none repeat scroll 0% 0% 220
8 none repeat scroll 0% 0% yellow 220
9 style= background none repeat scroll 0% 220
10 I do not want to be 212
nGram Frequency
1 one of the 12316
2 a lot of 10164
3 it is a 9772
4 I do not 8912
5 it is not 6748
6 as well as 6108
7 is going to 5696
8 are going to 5448
9 the end of 5388
10 out of the 5348
nGram Frequency
1 the end of the 2676
2 is going to be 2456
3 for the first time 2408
4 at the end of 2208
5 I do not know 2072
6 is one of the 2048
7 I do not think 2028
8 we are going to 2028
9 the rest of the 1996
10 do not want to 1888
nGram Frequency
1 at the end of the 1044
2 it is going to be 988
3 is going to be a 752
4 for the first time in 628
5 I do not want to 584
6 in the middle of the 584
7 for the first time since 560
8 there is a lot of 500
9 the end of the day 472
10 I do not know if 464
nGram Frequency
1 it is going to be a 328
2 could not be reached for comment 296
3 at the end of the day 232
4 I do not think it is 200
5 Centers for Disease Control and Prevention 188
6 on the New York Stock Exchange 180
7 we are going to have to 180
8 Rock and Roll Hall of Fame 168
9 by the end of the year 164
10 in the first round of the 160
nGram Frequency
1 I do not 23412
2 Thanks for the 15432
3 can not wait 14084
4 I can not 13320
5 I am not 11464
6 I will be 9184
7 it is a 8520
8 thanks for the 7760
9 for the follow 7556
10 not wait to 7256
nGram Frequency
1 can not wait to 7168
2 I am going to 5488
3 Thanks for the follow 3928
4 is going to be 3700
5 I do not know 3648
6 can not wait for 2940
7 I can not wait 2888
8 I do not think 2584
9 not wait to see 2568
10 Thanks for the RT 2500
nGram Frequency
1 can not wait to see 2560
2 I can not wait to 1328
3 it is going to be 1240
4 is going to be a 1108
5 I do not want to 1100
6 can not wait for the 748
7 I do not know what 676
8 not wait to see you 668
9 I do not think I 604
10 at the end of the 600
nGram Frequency
1 can not wait to see you 664
2 it is going to be a 524
3 I can not wait to see 412
4 Happy Mother s Day to all 368
5 can not wait to see the 264
6 can not wait to see what 256
7 I am going to have to 240
8 I think I am going to 240
9 I am not the only one 232
10 Mother s Day to all the 216
nGram Frequency
1 I do not 56148
2 it is a 29692
3 one of the 28724
4 I am not 27968
5 a lot of 26944
6 I can not 24340
7 it is not 21764
8 I did not 20824
9 I have been 20172
10 do not know 18512
nGram Frequency
1 I am going to 11296
2 I do not know 10536
3 can not wait to 8632
4 is going to be 7948
5 I do not think 7292
6 do not want to 7268
7 the end of the 7248
8 the rest of the 6108
9 at the end of 5692
10 for the first time 5600
nGram Frequency
1 I do not want to 3160
2 can not wait to see 3020
3 at the end of the 2876
4 it is going to be 2740
5 is going to be a 2264
6 I can not wait to 2220
7 I do not know what 1832
8 in the middle of the 1660
9 I do not know if 1588
10 I am not going to 1516
nGram Frequency
1 it is going to be a 1028
2 Vested interests Vested interests Vested interests 1000
3 interests Vested interests Vested interests Vested 996
4 can not wait to see you 688
5 I can not wait to see 688
6 at the end of the day 684
7 I do not know if I 488
8 I am going to have to 484
9 I do not think it is 476
10 can not wait to see what 460
#Function:
#"’" is replaced with "'", contractions are expanded.
#"U.S." is replaced with "US".
#"a.m." is replaced with "am".
#"p.m." is replaced with "pm".
replacement <- function(data) {
temp <- replace_contraction(gsub("’", "'", data),
contraction.key = lexicon::key_contractions)
temp <- gsub("U.S.", "US", temp)
temp <- gsub("a.m.", "am", temp)
temp <- gsub("p.m.", "pm", temp)
return(temp)
}
#This function will list ngrams for a specified n and file.
n_gram_creator <- function(data, n) {
n_grams = NGramTokenizer(data, Weka_control(min = n, max = n))
return(n_grams)
}
#This function will create a Frequency dataframe of n_grams.
df_freq_creator <- function(n_grams) {
df <- data.frame(sort(table(n_grams), decreasing = T)) %>%
`colnames<-`(c("nGram", "Frequency"))
return(df)
}