The data collected by The Behavioral Risk Factor Surveillance System (BRFSS) seems to be a randomly selected sample of the US population. Between 2013 - 2014, the population of the US was approximately 316 - 318 million. The sample size of 491,775 people interviewed is approximately 0.15%. I have some reservations on the nature of the selection process, which was not specified completely.
First, approximately 1.5% of the US population may not have been represented, as “Overall, an estimated 97.5% of US households had telephone service in 2012.”.
Second, the distribution frequency among the residents interviewed in each state could be a misrepresentation. The vast majority of states had over 5,000 residents interviewed. However, if this were a more reflective representation, the states with the highest numbers of interviewed people would be, in order, California, Texas, Florida, and New York, the most populous states in 2013. However, Florida is vastly overrepresented with 34,186 interviews, followed by Kansas (23,282), Nebraska (17,139), and Massachusetts(15,071).
This was an observational study, as no hypothesis, controls, nor confounding variables were specified beforehand. We can use this data to generalize the United States, but not make causal arguments.
Research question 1:
Interview Month Frequency varies between 34,172 in January to 44,452 in March. It seems to me that people would have less time in the spring and summer months, as people are more likely to have kids at home, as well as have previous engagements. In the fall and winter months I would guess people would have more time to complete the interview, but some would again have to spend more time with their kids, or more time shopping for gifts.
Is there a relationship between the Final Disposition (whether or not the interview was completed), the Interview Month, and number of Children?
Research question 2:
I am genuinely curious as to how aware people are of their own health. Are they accurate and not in denial?
Is there a relationship between opinion of ones General Health and having been diagnosed with High Blood Cholesterol and a Heart Attack?
Research question 3:
Numerous studies link sleep with diabetes, as hormones play an important role during rest, influencing glucose regulation. I have sourced two such studies.
Impact of sleep and sleep loss on glucose homeostasis and appetite regulation
Role of sleep duration in the regulation of glucose metabolism and appetite
Is this relationship of Duration of Sleep and Diabetes consistent in the BRFSS dataset?
Research question 1:
Is there a relationship between the Final Disposition (whether or not the interview was completed), the Interview Month, and number of Children?
I first selected only the three columns I am observing, imonth, dispcode, and children. Here is the summary.
imonth dispcode children
March : 44476 Completed interview :433222 Min. : 0.0000
July : 43667 Partially completed interview: 58548 1st Qu.: 0.0000
April : 42936 NA's : 5 Median : 0.0000
February: 42867 Mean : 0.5167
August : 42301 3rd Qu.: 1.0000
(Other) :275525 Max. :47.0000
NA's : 3 NA's :2274
3 NAs are from imonth, 5 NAs are from dispcode, and 2,274 NAs are from Refused, and [Missing] from children. I removed these entries. From the tally of children, we see that the vast majority of children in the households interviewed are 0 - 3. 4 or more children will be grouped as one count.
Month Children Disposition Count
1 January 0 Completed interview 22529
2 January 0 Partially completed interview 2457
3 January 1 Completed interview 3290
4 January 1 Partially completed interview 471
5 January 2 Completed interview 2866
6 January 2 Partially completed interview 404
7 January 3 Completed interview 1145
8 January 3 Partially completed interview 178
9 January 4 or More Completed interview 617
10 January 4 or More Partially completed interview 103
11 February 0 Completed interview 27817
12 February 0 Partially completed interview 3271
13 February 1 Completed interview 4133
14 February 1 Partially completed interview 593
15 February 2 Completed interview 3596
16 February 2 Partially completed interview 585
17 February 3 Completed interview 1542
18 February 3 Partially completed interview 252
19 February 4 or More Completed interview 789
20 February 4 or More Partially completed interview 137
21 March 0 Completed interview 28662
22 March 0 Partially completed interview 3446
23 March 1 Completed interview 4382
24 March 1 Partially completed interview 709
25 March 2 Completed interview 3777
26 March 2 Partially completed interview 657
27 March 3 Completed interview 1458
28 March 3 Partially completed interview 274
29 March 4 or More Completed interview 746
30 March 4 or More Partially completed interview 145
31 April 0 Completed interview 27909
32 April 0 Partially completed interview 3317
33 April 1 Completed interview 4051
34 April 1 Partially completed interview 681
35 April 2 Completed interview 3524
36 April 2 Partially completed interview 640
37 April 3 Completed interview 1501
38 April 3 Partially completed interview 258
39 April 4 or More Completed interview 725
40 April 4 or More Partially completed interview 128
41 May 0 Completed interview 25909
42 May 0 Partially completed interview 3395
43 May 1 Completed interview 3828
44 May 1 Partially completed interview 667
45 May 2 Completed interview 3300
46 May 2 Partially completed interview 594
47 May 3 Completed interview 1394
48 May 3 Partially completed interview 271
49 May 4 or More Completed interview 704
50 May 4 or More Partially completed interview 136
51 June 0 Completed interview 25179
52 June 0 Partially completed interview 2838
53 June 1 Completed interview 3507
54 June 1 Partially completed interview 581
55 June 2 Completed interview 2911
56 June 2 Partially completed interview 503
57 June 3 Completed interview 1257
58 June 3 Partially completed interview 226
59 June 4 or More Completed interview 633
60 June 4 or More Partially completed interview 107
61 July 0 Completed interview 28506
62 July 0 Partially completed interview 3396
63 July 1 Completed interview 4054
64 July 1 Partially completed interview 622
65 July 2 Completed interview 3652
66 July 2 Partially completed interview 535
67 July 3 Completed interview 1511
68 July 3 Partially completed interview 247
69 July 4 or More Completed interview 789
70 July 4 or More Partially completed interview 130
71 August 0 Completed interview 27700
72 August 0 Partially completed interview 3207
73 August 1 Completed interview 3964
74 August 1 Partially completed interview 659
75 August 2 Completed interview 3419
76 August 2 Partially completed interview 582
77 August 3 Completed interview 1439
78 August 3 Partially completed interview 229
79 August 4 or More Completed interview 789
80 August 4 or More Partially completed interview 143
81 September 0 Completed interview 25223
82 September 0 Partially completed interview 3014
83 September 1 Completed interview 3527
84 September 1 Partially completed interview 545
85 September 2 Completed interview 3134
86 September 2 Partially completed interview 508
87 September 3 Completed interview 1329
88 September 3 Partially completed interview 231
89 September 4 or More Completed interview 742
90 September 4 or More Partially completed interview 120
91 October 0 Completed interview 27887
92 October 0 Partially completed interview 3271
93 October 1 Completed interview 3721
94 October 1 Partially completed interview 602
95 October 2 Completed interview 3351
96 October 2 Partially completed interview 578
97 October 3 Completed interview 1476
98 October 3 Partially completed interview 278
99 October 4 or More Completed interview 755
100 October 4 or More Partially completed interview 145
101 November 0 Completed interview 27160
102 November 0 Partially completed interview 3595
103 November 1 Completed interview 3727
104 November 1 Partially completed interview 676
105 November 2 Completed interview 3334
106 November 2 Partially completed interview 643
107 November 3 Completed interview 1416
108 November 3 Partially completed interview 318
109 November 4 or More Completed interview 676
110 November 4 or More Partially completed interview 153
111 December 0 Completed interview 26359
112 December 0 Partially completed interview 3431
113 December 1 Completed interview 3584
114 December 1 Partially completed interview 634
115 December 2 Completed interview 3025
116 December 2 Partially completed interview 564
117 December 3 Completed interview 1342
118 December 3 Partially completed interview 256
119 December 4 or More Completed interview 740
120 December 4 or More Partially completed interview 152
The plot below graphs the data from the summary above. This representation highlights that most of the households interviewed had no children, and as the number of children increase the number of households decreases. The number of interviews conducted seems to be fairly spread amongst the months. The number of interviews that were not completed seems to be a consistent small percentage, regardless of children and months.
I was not satisfied with the above plot, so I decided to explore the data again. The Counts of Completed Interviews and Partially Completed Interviews were used to calculate a Percent. The data is shown below.
Month Children Percent
1 January 0 0.9016649
2 January 1 0.8747673
3 January 2 0.8764526
4 January 3 0.8654573
5 January 4 or More 0.8569444
6 February 0 0.8947826
7 February 1 0.8745239
8 February 2 0.8600813
9 February 3 0.8595318
10 February 4 or More 0.8520518
11 March 0 0.8926747
12 March 1 0.8607346
13 March 2 0.8518268
14 March 3 0.8418014
15 March 4 or More 0.8372615
16 April 0 0.8937744
17 April 1 0.8560862
18 April 2 0.8463016
19 April 3 0.8533258
20 April 4 or More 0.8499414
21 May 0 0.8841455
22 May 1 0.8516129
23 May 2 0.8474576
24 May 3 0.8372372
25 May 4 or More 0.8380952
26 June 0 0.8987044
27 June 1 0.8578767
28 June 2 0.8526655
29 June 3 0.8476062
30 June 4 or More 0.8554054
31 July 0 0.8935490
32 July 1 0.8669803
33 July 2 0.8722235
34 July 3 0.8594994
35 July 4 or More 0.8585419
36 August 0 0.8962371
37 August 1 0.8574519
38 August 2 0.8545364
39 August 3 0.8627098
40 August 4 or More 0.8465665
41 September 0 0.8932606
42 September 1 0.8661591
43 September 2 0.8605162
44 September 3 0.8519231
45 September 4 or More 0.8607889
46 October 0 0.8950189
47 October 1 0.8607449
48 October 2 0.8528888
49 October 3 0.8415051
50 October 4 or More 0.8388889
51 November 0 0.8831084
52 November 1 0.8464683
53 November 2 0.8383203
54 November 3 0.8166090
55 November 4 or More 0.8154403
56 December 0 0.8848271
57 December 1 0.8496918
58 December 2 0.8428532
59 December 3 0.8397997
60 December 4 or More 0.8295964
The plot below graphs the data from the summary above. Some trends are now clearly visible. The households with 0 children have the highest percentage of completion, and it seems that as children increase, the percentage drops slightly.
November has a noticeably lower rate of completion than the other months, which could possibly be attributed to traveling for Thanksgiving and preparation for Christmas. December also has a low rate of completion, possibly because of Christmas and New Year. January and February have relatively high rates of completion, which could possibly be attributed to end of the holiday season, colder weather, and people mainly spending time at home.
It should be emphasized that this completion rate is between 81.5% and 90.2%, regardless of number of children and month.
Research question 2:
Is there a relationship between opinion of ones General Health and having been diagnosed with High Blood Cholesterol and a Heart Attack?
I first selected only the three columns I am observing, genhlth, toldhi2, and cvdinfr4. Here is the summary.
genhlth toldhi2 cvdinfr4
Excellent: 85482 Yes :183501 Yes : 29284
Very good:159076 No :236612 No :459904
Good :150555 NA's: 71662 NA's: 2587
Fair : 66726
Poor : 27951
NA's : 1985
1,985 NAs are from Don’t know/Not Sure, Refused, and [Missing] from genhlth, 71,662 NAs are from Don’t know/Not Sure, Refused, and [Missing] from toldhi2, and 2,587 NAs are from Don’t know/Not Sure, Refused, and [Missing] from cvdinfr4. I removed these entries. toldhi2 and cvdinfr4 were combined into one column. The columns were renamed.
Health Ch_HA Count
1 Excellent No_No 51752
2 Excellent No_Yes 471
3 Excellent Yes_No 18439
4 Excellent Yes_Yes 589
5 Very good No_No 81654
6 Very good No_Yes 1478
7 Very good Yes_No 50502
8 Very good Yes_Yes 2590
9 Good No_No 63254
10 Good No_Yes 2629
11 Good Yes_No 55098
12 Good Yes_Yes 5985
13 Fair No_No 22152
14 Fair No_Yes 2222
15 Fair Yes_No 27334
16 Fair Yes_Yes 5705
17 Poor No_No 7889
18 Poor No_Yes 1444
19 Poor Yes_No 10770
20 Poor Yes_Yes 4579
I was not satisfied with the above plot, so I decided to explore the data again. I created a stacked barchart by percents.
We can clearly see that the Interviewees’ opinion of their health consistently corresponds with rates of being diagnosed with neither high cholesterol nor heart attacks. Inversely, it also consistently corresponds with being diagnosed with both high cholesterol and heart attacks.
Without doing an in depth analysis, this graph is very telling. People seem to be somewhat aware of their own health.
Research question 3:
Is this relationship of Duration of Sleep and Diabetes consistent in the BRFSS dataset?
I first selected only the two columns I am observing, sleptim1 and diabete3. Here is the summary.
sleptim1 diabete3
Min. : 0.000 Yes : 62363
1st Qu.: 6.000 Yes, but female told only during pregnancy: 4602
Median : 7.000 No :415374
Mean : 7.052 No, pre-diabetes or borderline diabetes : 8604
3rd Qu.: 8.000 NA's : 832
Max. :450.000
NA's :7387
7,387 NAs are from Don’t know/Not Sure and Refused from sleptim1, and 832 NAs are from Don’t know/Not Sure, Refused, and [Missing] from diabete3. I removed these entries. There were also one interview each where the recorded entry for sleep 103 and 450. Both of these observations were eliminated. The columns were renamed.
Sleep Diabetes Count
1 1 Yes 49
2 1 Yes, but female told only during pregnancy 1
3 1 No 170
4 1 No, pre-diabetes or borderline diabetes 5
5 2 Yes 251
6 2 Yes, but female told only during pregnancy 14
7 2 No 766
8 2 No, pre-diabetes or borderline diabetes 34
9 3 Yes 749
10 3 Yes, but female told only during pregnancy 42
11 3 No 2588
12 3 No, pre-diabetes or borderline diabetes 102
13 4 Yes 2783
14 4 Yes, but female told only during pregnancy 154
15 4 No 10908
16 4 No, pre-diabetes or borderline diabetes 374
17 5 Yes 5272
18 5 Yes, but female told only during pregnancy 355
19 5 No 27045
20 5 No, pre-diabetes or borderline diabetes 686
21 6 Yes 13282
22 6 Yes, but female told only during pregnancy 1115
23 6 No 89692
24 6 No, pre-diabetes or borderline diabetes 1948
25 7 Yes 13574
26 7 Yes, but female told only during pregnancy 1322
27 7 No 125286
28 7 No, pre-diabetes or borderline diabetes 2114
29 8 Yes 17440
30 8 Yes, but female told only during pregnancy 1182
31 8 No 120013
32 8 No, pre-diabetes or borderline diabetes 2279
33 9 Yes 3379
34 9 Yes, but female told only during pregnancy 203
35 9 No 19754
36 9 No, pre-diabetes or borderline diabetes 433
37 10 Yes 2512
38 10 Yes, but female told only during pregnancy 88
39 10 No 9185
40 10 No, pre-diabetes or borderline diabetes 285
41 11 Yes 160
42 11 Yes, but female told only during pregnancy 12
43 11 No 634
44 11 No, pre-diabetes or borderline diabetes 25
45 12 Yes 887
46 12 Yes, but female told only during pregnancy 30
47 12 No 2671
48 12 No, pre-diabetes or borderline diabetes 80
49 13 Yes 33
50 13 Yes, but female told only during pregnancy 3
51 13 No 163
52 14 Yes 107
53 14 Yes, but female told only during pregnancy 6
54 14 No 321
55 14 No, pre-diabetes or borderline diabetes 10
56 15 Yes 113
57 15 Yes, but female told only during pregnancy 2
58 15 No 241
59 15 No, pre-diabetes or borderline diabetes 10
60 16 Yes 92
61 16 Yes, but female told only during pregnancy 2
62 16 No 266
63 16 No, pre-diabetes or borderline diabetes 8
64 17 Yes 5
65 17 Yes, but female told only during pregnancy 1
66 17 No 28
67 17 No, pre-diabetes or borderline diabetes 1
68 18 Yes 46
69 18 Yes, but female told only during pregnancy 2
70 18 No 111
71 18 No, pre-diabetes or borderline diabetes 5
72 19 Yes 4
73 19 No 9
74 20 Yes 23
75 20 Yes, but female told only during pregnancy 1
76 20 No 37
77 20 No, pre-diabetes or borderline diabetes 3
78 21 Yes 2
79 21 No 1
80 22 Yes 4
81 22 No 6
82 23 Yes 1
83 23 No 2
84 23 No, pre-diabetes or borderline diabetes 1
85 24 Yes 9
86 24 Yes, but female told only during pregnancy 1
87 24 No 24
88 24 No, pre-diabetes or borderline diabetes 1
Sleep Diabetes Count
1 5 or Less Yes 9104
2 5 or Less Yes, but female told only during pregnancy 566
3 5 or Less No 41477
4 5 or Less No, pre-diabetes or borderline diabetes 1201
5 6 Yes 13282
6 6 Yes, but female told only during pregnancy 1115
7 6 No 89692
8 6 No, pre-diabetes or borderline diabetes 1948
9 7 Yes 13574
10 7 Yes, but female told only during pregnancy 1322
11 7 No 125286
12 7 No, pre-diabetes or borderline diabetes 2114
13 8 Yes 17440
14 8 Yes, but female told only during pregnancy 1182
15 8 No 120013
16 8 No, pre-diabetes or borderline diabetes 2279
17 9 or More Yes 7377
18 9 or More Yes, but female told only during pregnancy 351
19 9 or More No 33453
20 9 or More No, pre-diabetes or borderline diabetes 862
Upon inspection of the summary, one can clearly see that the vast majority of interviewees reported 6-8 hours of sleep. Therefore, I grouped 1-5 hours as “5 or Less” and 9-24 hours as “9 or More”. The second above output is that new modified summary.
I created two plots below, corresponding to the summaries. On the left is the unmodified data, by percent The plot below on the right is the modified data with only five categories for hours of sleep, in absolute number of interviews.
From the graph on the left, one can see that the lowest rate of diabetes corresponds to those who reported 7 hours of sleep. Coincidentally, this is also the most common hours of sleep reported. 6 and 8 hours of sleep have the lowest rates of diabetes and are also the most common, after 7 hours. One can see a sharp increase in the rate of diabetes for those who reported 5 hours or less. The trend for 9 or more hours follows no obvious pattern, but overall it seems that too much sleep/being sedentary is worse than too little sleep.
It is important that 9 or more contains the smallest number of interviews spread across the widest range of hours of sleep.
Without doing any specific, in depth analysis, the preliminary exploratory plots here, comparing sleep to diabetes, agree with the common saying of 7-8 hours of sleep a night for adults.
All code for each research question is included here.
#Select only "Interview Month", "Final Disposition", and "Number Of Children In Household".
month_Code_Children <- brfss2013 %>%
select(imonth, dispcode, children)
#Initial Summary.
month_Code_Children %>%
summary()
#Currently 491,775 observations.
#3 NAs are from imonth.
#5 NAs are from dispcode.
#2,274 NAs are from Refused, and [Missing] from children
#Remove NAs.
month_Code_Children <- na.omit(month_Code_Children)
month_Code_Children$children <- as.numeric(month_Code_Children$children)
#Combine 4-24 children into one value.
month_Code_Children[month_Code_Children$children > 3, ]$children <- "4 or More"
month_Code_Children$children <- as.character(month_Code_Children$children)
#Get a count of each completion level for each children level for each month level.
month_Code_Children <- data.frame(month_Code_Children %>%
group_by(imonth, children, dispcode) %>%
tally())
#Change the column names.
colnames(month_Code_Children) <- c("Month", "Children", "Disposition", "Count")
#Count grouped by Month, Children, and Disposition Summary.
month_Code_Children
Plot : Number of Household Interviews by Children and Month
#Helper Function to create a two line plot of completed and partially completed interviews based on number
#of children entered.
plot_Q1 <- function(df, children, boolean) {
g <- ggplot(df %>% filter(Children == children), aes(x = Month, y = Count, color = Disposition))
g <- g + geom_point(size = 5) + geom_line(aes(group = Disposition), size = 1.2)
#Viridis Color Scheme.
g <- g + scale_color_viridis_d()
#Boolean is FALSE for the first 4 graphs. No X-axis labels and no legend. The last graph defines those.
if (boolean == FALSE) {
#Y-axis.
g <- g + scale_y_continuous(name = paste(children, "Children"),
labels = comma)
#Modify labels and text.
g <- g + theme(axis.text.x = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_text(face = "bold"),
legend.position = "none")
} else {
#Y-axis.
g <- g + scale_y_continuous(name = children,
labels = comma)
#Modify labels and text.
g <- g + theme(axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
axis.title.x = element_blank(),
axis.title.y = element_text(face = "bold"),
legend.title = element_text(face = "bold"),
legend.key = element_blank(),
legend.position = "bottom")
}
}
#Main Title.
title <- ggdraw() +
draw_label("Number of Household Interviews by Children and Month",
fontface = 'bold', hjust = 0.45, size = 16)
#Alignment of 5 plots.
graphs <- plot_grid(plot_Q1(month_Code_Children, 0, FALSE),
plot_Q1(month_Code_Children, 1, FALSE),
plot_Q1(month_Code_Children, 2, FALSE),
plot_Q1(month_Code_Children, 3, FALSE),
plot_Q1(month_Code_Children, "4 or More", TRUE),
align = "v", nrow = 5,
rel_heights = c(2/13, 2/13, 2/13, 2/13, 5/13))
#Add Title.
plot_grid(title, graphs, ncol = 1, rel_heights = c(.05, .95))
#Spread Disposition into two new columns, values are the Counts.
month_Code_Children <- month_Code_Children %>%
spread(key = Disposition, value = Count)
#Two new columns are Yes (Completed) and No (Partially Completed)
colnames(month_Code_Children) <- c("Month", "Children", "Yes", "No")
#Create a new column Percent from Yes and No.
month_Code_Children <- month_Code_Children %>%
mutate(Percent = Yes / (Yes + No)) %>%
select(1,2,5)
#% Completion grouped by Month and Children Summary.
month_Code_Children
Plot : % of Completed Interviews by Children and Month
#Scatter and Line Plot.
g1 <- ggplot(month_Code_Children, aes(x = Month, y = Percent, color = Children))
g1 <- g1 + geom_point(size = 6) + geom_line(aes(group = Children), size = 1.3)
#Title.
g1 <- g1 + ggtitle("% of Completed Interviews by Children and Month")
#Viridis Color Scheme.
g1 <- g1 + scale_color_viridis_d()
#Y-axis.
g1 <- g1 + scale_y_continuous(name = "Completed Interviews",
labels = percent)
#Modify labels and text.
g1 <- g1 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
axis.title.x = element_blank(),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"),
legend.title = element_text(face = "bold"),
legend.key = element_blank(),
legend.position = "bottom")
g1
#Select only "Interview Month", "Final Disposition", and "Number Of Children In Household".
health_Ch_HA <- brfss2013 %>%
select(genhlth, toldhi2, cvdinfr4)
#Initial Summary.
health_Ch_HA %>%
summary()
#Currently 491,775 observations.
#1,985 NAs are from genhlth.
#71,662 NAs are from toldhi2.
#2,587 NAs are from cvdinfr4.
#Remove NAs.
health_Ch_HA <- na.omit(health_Ch_HA)
#Combine toldhi2 and cvdinfr4 into one variable.
health_Ch_HA$Ch_HA <- paste(health_Ch_HA$toldhi2, health_Ch_HA$cvdinfr4, sep = "_")
#Get a count of each health level for each cholesterol/heart attack level.
health_Ch_HA <- data.frame(health_Ch_HA %>%
select(1, 4) %>%
group_by(genhlth, Ch_HA) %>%
tally())
#Change the column names.
colnames(health_Ch_HA) <- c("Health", "Ch_HA", "Count")
#Count grouped by Health and Cholesterol/Heart Attack.
health_Ch_HA
Plot : Cholesterol and Heart Attack Diagnoses vs Opinion of Health
#Scatter and Line Plot.
g3 <- ggplot(health_Ch_HA, aes(x = Health, y = Count, color = Ch_HA, shape = Ch_HA))
g3 <- g3 + geom_point(size = 6)
g3 <- g3 + geom_line(aes(group = Ch_HA), size = 1.3)
#Title.
g3 <- g3 + ggtitle("Cholesterol and Heart Attack\nDiagnoses vs Opinion of Health")
#X-axis.
g3 <- g3 + scale_x_discrete(name = "Interviewer's Opinion of Their Own Health")
#Y-axis.
g3 <- g3 + scale_y_continuous(name = "Number of Interviews",
labels = comma)
#Viridis Color Scheme.
g3 <- g3 + scale_color_manual(name = "Cholesterol and Heart\nAttack Diagnoses",
labels = c("No to Both",
"Has had a Heart Attack,\nbut not High Cholesterol",
"Has had High Cholesterol,\nbut not a Heart Attack",
"Yes to Both"),
values = c(viridis(4)[2], viridis(4)[2],viridis(4)[3], viridis(4)[3]))
#Shape Scheme.
g3 <- g3 + scale_shape_manual(name = "Cholesterol and Heart\nAttack Diagnoses",
labels = c("No to Both",
"Has had a Heart Attack,\nbut not High Cholesterol",
"Has had High Cholesterol,\nbut not a Heart Attack",
"Yes to Both"),
values = c(16, 17, 16, 17))
#Modify labels and text.
g3 <- g3 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"),
legend.title = element_text(hjust = 0.5, face = "bold"),
legend.key = element_blank(),
legend.key.size = unit(1, "cm"))
g3
Plot : Cholesterol and Heart Attack Diagnoses vs Opinion of Health
#Stacked Barchart, total = 100%.
g4 <- ggplot(health_Ch_HA, aes(x = Health, y = Count,fill = Ch_HA))
g4 <- g4 + geom_bar(position = "fill", stat = "identity")
#Title
g4 <- g4 + ggtitle(label = "Cholesterol and Heart Attack\nDiagnoses vs Opinion of Health")
#X-axis.
g4 <- g4 + scale_x_discrete(name = "Interviewer's Opinion of Their Own Health",
expand = c(0, 0))
#Y-axis.
g4 <- g4 + scale_y_continuous(name = "Percent of Interviews",
labels = percent,
expand = c(0, 0))
#Viridis Fill Scheme, for two variables.
g4 <- g4 + scale_fill_viridis_d(name = "Cholesterol and Heart\n Attack Diagnoses",
labels = c("No to Both",
"Has had a Heart Attack,\nbut not High Cholesterol",
"Has had High Cholesterol,\nbut not a Heart Attack",
"Yes to Both"))
#Modify labels and text.
g4 <- g4 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"),
legend.title = element_text(hjust = 0.5, face = "bold"),
legend.key.size = unit(1, "cm"))
g4
#Select only "How Much Time Do You Sleep" and "(Ever Told) You Have Diabetes"
sleep_Diabetes <- brfss2013 %>%
select(sleptim1, diabete3)
#Initial Summary.
sleep_Diabetes %>%
summary()
#Currently 491,775 observations.
#7,387 NAs are from Don't know/Not Sure and Refused from sleptim1.
#832 NAs are from Don't know/Not Sure, Refused, and [Missing] from diabete3.
#Remove NAs.
sleep_Diabetes <- na.omit(sleep_Diabetes)
#Remove two observations, 103 and 450 hours of sleep.
sleep_Diabetes <- sleep_Diabetes %>%
filter(sleptim1 <= 24)
#Get a count of each diabetes level for each sleep level.
sleep_Diabetes1 <- data.frame(sleep_Diabetes %>%
group_by(sleptim1, diabete3) %>%
tally())
#Change the column names.
colnames(sleep_Diabetes1) <- c("Sleep", "Diabetes", "Count")
#Count grouped by Sleep and Diabetes Summary.
sleep_Diabetes1
sleep_Diabetes$sleptim1 <- as.numeric(sleep_Diabetes$sleptim1)
#Combine 9-24 hours of sleep into one value.
sleep_Diabetes[sleep_Diabetes$sleptim1 > 8, ]$sleptim1 <- "9 or More"
#Combine 1-5 hours of sleep into one value.
sleep_Diabetes[sleep_Diabetes$sleptim1 < 6, ]$sleptim1 <- "5 or Less"
sleep_Diabetes$sleptim1 <- as.character(sleep_Diabetes$sleptim1)
#Get a count of each diabetes level for each sleep level.
sleep_Diabetes2 <- data.frame(sleep_Diabetes %>%
group_by(sleptim1, diabete3) %>%
tally())
#Change the column names.
colnames(sleep_Diabetes2) <- c("Sleep", "Diabetes", "Count")
#Count grouped by Sleep and Diabetes Summary.
sleep_Diabetes2
Plot : Comparing Quantity of Sleep and Diabetes Diagnosis
#Stacked Barchart, total = 100%.
g5 <- ggplot(sleep_Diabetes1, aes(x = Sleep, y = Count, fill = Diabetes))
g5 <- g5 + geom_bar(position = "fill", stat = "identity")
#No title, but leave space at top.
g5 <- g5 + ggtitle(label = "")
#X-axis.
g5 <- g5 + scale_x_continuous(name = "",
breaks = c(6, 12, 18, 24),
labels = c("6", "12", "18", "24"),
expand = c(0, 0))
#Y-axis.
g5 <- g5 + scale_y_continuous(name = "Percent of Interviews",
labels = percent,
expand = c(0, 0))
#Viridis Fill Scheme.
g5 <- g5 + scale_fill_viridis_d()
#Modify labels and text. Remove legend from left plot.
g5 <- g5 + theme(plot.title = element_text(hjust = 1, size = 16, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"),
legend.position = "none")
#Stacked Barchart, total = absolute count.
g6 <- ggplot(sleep_Diabetes2, aes(x = Sleep, y = Count, fill = Diabetes))
g6 <- g6 + geom_bar(position = "stack", stat = "identity")
#Shared Title.
g6 <- g6 + ggtitle(label = "Comparing Quantity of Sleep and Diabetes Diagnosis")
#X-axis.
g6 <- g6 + scale_x_discrete(name = "Reported Hours of Sleep Each Night",
expand = c(0, 0))
#Y-axis.
g6 <- g6 + scale_y_continuous(name = "Number of Interviews",
position = "right",
labels = comma,
expand = c(0, 0))
#Viridis Fill Scheme.
g6 <- g6 + scale_fill_viridis_d()
#Modify labels and text.
g6 <- g6 + theme(plot.title = element_text(hjust = 1.1, size = 16, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(hjust = 3, size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"))
#Modify legend orientation.
g6 <- g6 + guides(fill = guide_legend(title = "Diabetes Diagnosis",
title.position = "left",
nrow = 2, byrow = TRUE))
#Modify legend text. Save as object.
legend3 <- g_legend(g6 + theme(legend.title = element_text(size = 12, face = "bold"),
legend.text = element_text(size = 10),
legend.position = "right"))
#Remove legend from right plot.
g6 <- g6 + theme(legend.position = "none")
#Arrange left plot, right plot, and legend on the bottom.
grid.arrange(g5, g6, legend3, layout_matrix = matrix(rbind(c(1, 1, 1, 1, 1, 1, NA, 2, 2, 2, 2, 2, 2),
c(1, 1, 1, 1, 1, 1, NA, 2, 2, 2, 2, 2, 2),
c(1, 1, 1, 1, 1, 1, NA, 2, 2, 2, 2, 2, 2),
c(1, 1, 1, 1, 1, 1, NA, 2, 2, 2, 2, 2, 2),
c(1, 1, 1, 1, 1, 1, NA, 2, 2, 2, 2, 2, 2),
c(NA, NA, NA, NA, NA, 3, 3, 3, NA, NA, NA, NA, NA)),
ncol = 13))