The data collected by General Social Survey (GSS) Cumulative File 1972-2012 seems to be a randomly selected sample of the US population. Between 1972 - 2012, the population of the US was approximately 209 - 314 million.
I looked at several locations, but only found details about the sampling on Wikipedia. The survey is voluntary and took about 90 minutes, so I do have reservations. Also, the absolute and relative number of surveys gathered each year seems to vary.
This was an observational study, as no hypothesis, controls, nor confounding variables were specified beforehand. We can use this data to generalize and draw some trends in the United States, but not make causal arguments.
Looking at surveys from only females, I wanted to see if any trends emerge over time of the number of children and highest level education completed.
Does the average number of children vary by highest education level completed?
I first removed the NAs, filtered for only females, and selected the three columns I am observing, year, degree, and childs. Here is the summary.
year degree childs
Min. :1972 Lt High School: 6656 Min. :0.000
1st Qu.:1983 High School :16910 1st Qu.:1.000
Median :1993 Junior College: 1790 Median :2.000
Mean :1992 Bachelor : 4168 Mean :2.064
3rd Qu.:2002 Graduate : 1776 3rd Qu.:3.000
Max. :2012 Max. :8.000
From the plot below, we can see that a disproportionate number of surveys were filled out by those whose highest level of education was High School. For some years it is a majority of females. Whether or not this is a reflection of the true US population is unknown.
From the plot below we can clearly see that females who did not finish high school consistently have the highest average number of children. We can see somewhat of a trend that as the education level increases, the average number of children decreases.
The boxplot below agrees with the statement from above. Females who have not completed high school have a median of 3 children. Those who have completed high school or some college have a median of 2 children. Females who have completed their Bachelor or graduate degrees have a median of 1 children.
The summary statistics of the data are show below. The mean number of children decreases as education level increases.
Degree Count Mean Variance Skewness Kurtosis
1 Lt High School 6656 2.882512 4.713467 0.6998711 2.858344
2 High School 16910 2.006860 2.661494 0.8754421 4.007055
3 Junior College 1790 1.788268 2.213948 0.8523598 4.105106
4 Bachelor 4168 1.399712 2.062894 1.0052714 4.273650
5 Graduate 1776 1.378941 1.847872 0.9349814 4.091965
6 Overall 31300 2.064089 3.180020 0.9999736 4.092638
I. Independence.
The observations are independent based on the GSS sampling method. There is no evidence of any paired data.
II. Approximate Normality
From the box plot and histograms, it seems that for each level of education completed, the spread of average number of children is skewed right, which is what one would expect for number of children in an industrialized country like the United States. To get a more quantitive measure, I calculated the skewness and kurtosis for each group.
The skewness is 0.7 - 1.0 for each group, which follows that they are neither symmetric nor highly skewed, but moderately skewed.
The kurtosis is between 2.9 to 4.3 for each group, indicating leptokurtic data, since the 3 has been subtracted from the value. It is, however, not extreme.
III. Equal Variance
The variances range from 1.85 to 4.71 for the level of education completed. This is not equal, yet not extreme. I have spent a lot of time reading about the importance of this. Because my sample sizes are large, it is perhaps not critically important. Because they are not quite equal, there may be a higher likelihood of Type I Errors.
H0 : The average number of children for females is the same across all education levels.
µLt High School = µHigh School = µJr College = µBachelor = µGraduate
HA : The average number of children for females differs for at least one pair of education levels.
I am testing for 95% significance so α = 0.05.
Df Sum Sq Mean Sq F value Pr(>F)
degree 4 7323 1830.8 621.4 <0.0000000000000002 ***
Residuals 31295 92208 2.9
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is extremely small and less than α, we reject H0. However, we do not yet know which specific pair is different.
Degrees Center Lower_Limit Upper_Limit Adjusted_p_Value Significance
1 High School-Lt High School -0.8757 -0.9434 -0.8079 0.00000000 Reject H0
2 Junior College-Lt High School -1.0942 -1.2189 -0.9696 0.00000000 Reject H0
3 Bachelor-Lt High School -1.4828 -1.5753 -1.3903 0.00000000 Reject H0
4 Graduate-Lt High School -1.5036 -1.6286 -1.3785 0.00000000 Reject H0
5 Junior College-High School -0.2186 -0.3350 -0.1022 0.00000298 Reject H0
6 Bachelor-High School -0.6071 -0.6881 -0.5262 0.00000000 Reject H0
7 Graduate-High School -0.6279 -0.7447 -0.5111 0.00000000 Reject H0
8 Bachelor-Junior College -0.3886 -0.5209 -0.2562 0.00000000 Reject H0
9 Graduate-Junior College -0.4093 -0.5661 -0.2525 0.00000000 Reject H0
10 Graduate-Bachelor -0.0208 -0.1535 0.1119 0.99307736 Fail to Reject H0
The p-values are extremely small and less than α for all pairs except Bachelor-Graduate.
The largest difference of means occurs between Lt High School and, in order, Graduate, Bachelor, Jr. College, and High School. This agrees with what was displayed in the scatter plot in the Exploratory data Analysis.
The data does not strictly fit the assumptions for ANOVA. However, the samples are large, and the p-Values are extremely small.
Therefore, we are 95% confident that the difference in the mean number of children by females of different education levels, except for Bachelor and Graduate, is not 0.
Further studies would be necessary to establish causality.
All code for each Part is included here.
knitr::opts_chunk$set(comment = NA)
options("scipen" = 100)
#List for X-axis ticks.
years <- c(1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010, 2015)
#Count the number of surveys per year.
year_count <- data.frame(gss %>% group_by(year) %>% tally())
#Load US Population data.
us_pop <- read.csv("https://query.data.world/s/gomwflrhpj5ckb66wasrfyo62nhd3t")
#Merge the two above dataframes, create a percent column.
samples <- merge(x = year_count, y = us_pop, by.x = "year", by.y = "year") %>%
mutate(p = n/population) %>%
`colnames<-`(c("Year", "Count", "Population", "Percent"))
#Scatter, Year by Count.
g1 <- ggplot(samples, aes(x = Year, y = Count))
g1 <- g1 + geom_point(size = 5, color = viridis(4)[2])
#Title.
g1 <- g1 + ggtitle(label = "")
#X-axis.
g1 <- g1 + scale_x_continuous(name = "",
breaks = years,
labels = years)
#Y-axis.
g1 <- g1 + scale_y_continuous(name = "Number of Surveys Gathered",
labels = comma)
#Modify labels and text. Remove legend from left plot.
g1 <- g1 + theme(plot.title = element_text(hjust = 1, size = 16, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"))
#Scatter, Year by Count % of US Population.
g2 <- ggplot(samples, aes(x = Year, y = Percent))
g2 <- g2 + geom_point(size = 5, color = viridis(4)[3])
#Title.
g2 <- g2 + ggtitle(label = "Survey Sampling by Year")
#X-axis.
g2 <- g2 + scale_x_continuous(name = "GSS Survey Year",
breaks = years,
labels = years)
#Y-axis.
g2 <- g2 + scale_y_continuous(name = "US Population Surveyed",
position = "right",
labels = percent)
#Modify labels and text. Remove legend from left plot.
g2 <- g2 + theme(plot.title = element_text(hjust = -2.5, size = 16, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
axis.title.x = element_text(hjust = -0.7, size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"))
grid.arrange(g1, g2, layout_matrix = matrix(rbind(c(1, 1, 1, 1, 1, 1, NA, 2, 2, 2, 2, 2, 2)),
ncol = 13))
#Remove NAs, filter for only females, and select only "GSS Year", "RS highest degree", and "Number of children".
year_deg_child <- na.omit(gss %>%
filter(sex == "Female") %>%
select(year, degree, childs))
#Initial Summary.
summary(year_deg_child)
Plot : Count of Education Level by Year (Female Surveys Only)
#Group by year and degree, and count the total number of surveys.
year_deg <- data.frame(year_deg_child %>%
group_by(year, degree) %>%
tally()) %>%
`colnames<-`(c("Year", "Degree", "Count"))
#Scatter and Line Plot.
g3 <- ggplot(year_deg, aes(x = Year, y = Count, color = Degree))
g3 <- g3 + geom_point(size = 6) + geom_line(aes(group = Degree), size = 1.3)
#Title.
g3 <- g3 + ggtitle("Count of Education Level by Year (Female Surveys Only)")
#Viridis Color Scheme.
g3 <- g3 + scale_color_viridis_d()
#X-axis.
g3 <- g3 + scale_x_continuous(name = "GSS Survey Year",
breaks = years,
labels = years)
#Y-axis.
g3 <- g3 + scale_y_continuous(name = "Number of Surveys Gathered")
#Modify labels and text.
g3 <- g3 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
axis.title.x = element_blank(),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"),
legend.title = element_text(face = "bold"),
legend.key = element_blank(),
legend.position = "bottom")
g3
Plot : Average Number of Children by Degree and Year (Female Surveys Only)
#Calculate the mean number of children by year and degree.
year_deg_child_mean <- aggregate(year_deg_child$childs,
list(year_deg_child$year, year_deg_child$degree),
mean, na.rm = TRUE) %>%
`colnames<-`(c("Year", "Degree", "Children"))
#Scatter and Line Plot.
g4 <- ggplot(year_deg_child_mean, aes(x = Year, y = Children, color = Degree))
g4 <- g4 + geom_point(size = 6) + geom_line(aes(group = Degree), size = 1.3)
#Title.
g4 <- g4 + ggtitle("Average Number of Children by Degree and Year (Female Surveys Only)")
#Viridis Color Scheme.
g4 <- g4 + scale_color_viridis_d()
#X-axis.
g4 <- g4 + scale_x_continuous(name = "GSS Survey Year",
breaks = years,
labels = years)
#Y-axis.
g4 <- g4 + scale_y_continuous(name = "Number of Children")
#Modify labels and text.
g4 <- g4 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
axis.title.x = element_blank(),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"),
legend.title = element_text(face = "bold"),
legend.key = element_blank(),
legend.position = "bottom")
g4
Plot: Spread of Children by Education Level, 1972 - 2012 (Female Surveys Only)
#Spread of children by degree.
deg_child <- year_deg_child %>%
select(degree, childs) %>%
`colnames<-`(c("Degree", "Children"))
#BoxPlot.
g5 <- ggplot(deg_child, aes(x = Degree, y = Children, color = Degree))
g5 <- g5 + geom_boxplot(size = 1.3)
#Title.
g5 <- g5 + ggtitle("Spread of Children by Education Level, 1972 - 2012 (Female Surveys Only)")
#Viridis Color Scheme.
g5 <- g5 + scale_color_viridis_d()
#X-axis.
g5 <- g5 + scale_x_discrete(name = "Education Level")
#Y-axis.
g5 <- g5 + scale_y_continuous(name = "Number of Children")
#Modify labels and text.
g5 <- g5 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
axis.title.x = element_blank(),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"),
legend.title = element_text(face = "bold"),
legend.position = "none")
g5
Plot : Spread of Children by Education Level, 1972 - 2012 (Female Surveys Only)
#Distribution of children by degree.
g6 <- ggplot(deg_child, aes(x = Children, fill = Degree))
g6 <- g6 + geom_bar(stat = "count")
g6 <- g6 + facet_wrap(~Degree, scales = "free")
#Title.
g6 <- g6 + ggtitle("Spread of Children by Education Level, 1972 - 2012 (Female Surveys Only)")
#Viridis Color Scheme.
g6 <- g6 + scale_fill_viridis_d()
#X-axis
g6 <- g6 + scale_x_continuous("Number of Children")
#Y-axis.
g6 <- g6 + scale_y_continuous(name = "Frequency")
#Modify labels and text.
g6 <- g6 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"),
legend.title = element_text(face = "bold"),
legend.position = "none")
g6
Summary Statistics.
#For each degree, determine the number of surveys completed, as well as the mean, variance,
#skewness, and kurtosis of the number of children.
deg_child_stats <- do.call(data.frame,
aggregate(year_deg_child$childs,
list(year_deg_child$degree),
FUN = function(x) c(n = length(x),
m = mean(x),
var = var(x),
sk = skewness(x),
kt = kurtosis(x)))) %>%
`colnames<-`(c("Degree", "Count", "Mean", "Variance", "Skewness", "Kurtosis"))
#Bind the same stats for all the data to that which is grouped by degree.
deg_child_stats <- rbind(deg_child_stats,
data.frame("Degree" = "Overall",
"Count" = nrow(year_deg_child),
"Mean" = mean(year_deg_child$childs),
"Variance" = var(year_deg_child$childs),
"Skewness" = skewness(year_deg_child$childs),
"Kurtosis" = kurtosis(year_deg_child$childs)))
#Summary of calculations of children by degree.
deg_child_stats
ANOVA :
#ANOVA on children by degree. 5 groups, 10 pairs.
anova1 <- aov(childs ~ degree, data = year_deg_child)
#ANOVA summary.
summary(anova1)
Statistics : 95% Family-Wise Confidence Level.
#Pairwise comparisons.
degree_comparisons <- data.frame((TukeyHSD(anova1))[1]) %>%
`colnames<-`(c("Center", "Lower_Limit", "Upper_Limit", "Adjusted_p_Value"))
degree_comparisons$Degrees <- rownames(degree_comparisons)
rownames(degree_comparisons) <- NULL
#Round values.
degree_comparisons<- degree_comparisons %>%
mutate(Degrees = as.factor(Degrees),
Center = round(Center, 4),
Lower_Limit = round(Lower_Limit, 4),
Upper_Limit = round(Upper_Limit, 4),
Adjusted_p_Value = round(Adjusted_p_Value, 8),
Significance = ifelse(Adjusted_p_Value > .9, "Fail to Reject H0", "Reject H0")) %>%
select(5, 1, 2, 3, 4, 6)
#Define factor level to preserve order.
degree_comparisons$Degrees <- factor(degree_comparisons$Degrees,
levels = degree_comparisons$Degrees[order(-as.integer(rownames(degree_comparisons)))])
#Summary of 10 pairs.
print(degree_comparisons, width = 100)
Plot: 95% Family-Wise Confidence Level.
#Confidence Level Plot.
g7 <- ggplot(degree_comparisons, aes(y = Degrees))
#Vertical line at x = 0, no difference in means.
g7 <- g7 + geom_vline(xintercept = 0,
linetype = "dashed",
color = viridis(6)[1],
size = 1)
#Lower Limit, Center, and Upper Limit for each pair.
g7 <- g7 + geom_point(aes(x = Lower_Limit, color = Significance), size = 5, pch = "|")
g7 <- g7 + geom_point(aes(x = Center, color = Significance), size = 3, pch = "|")
g7 <- g7 + geom_point(aes(x = Upper_Limit, color = Significance), size = 5, pch = "|")
#Connect Lower Limit and Upper Limit for each pair.
g7 <- g7 + geom_segment(aes(x = Lower_Limit,
y = Degrees,
xend = Upper_Limit,
yend = Degrees,
color = Significance),
size = 1.25)
#Title.
g7 <- g7 + ggtitle("95% Family-Wise Confidence Level")
#X-axis.
g7 <- g7 + xlab("Difference in Mean Number of Children by Degree (Female Surveys Only)")
#Y-axis.
g7 <- g7 + ylab("Degree Comparisons")
#Viridis Color Scale.
g7 <- g7 + scale_color_manual(name = "Significance",
labels = c("Fail to Reject H0", "Reject H0"),
values = c(viridis(5)[3], viridis(5)[4]))
g7 <- g7 + guides(color = guide_legend(reverse=TRUE))
#Viridis for Y-axis ticks.
fontcolor <- rev(ifelse(degree_comparisons$Significance == "Fail to Reject H0", viridis(5)[3], viridis(5)[4]))
#Modify labels and text.
g7 <- g7 + theme(plot.title = element_text(hjust = 0.5, size = 18, face = "bold"),
axis.text.x = element_text(size = 11),
axis.text.y = element_text(size = 11, angle = 25, color = fontcolor),
axis.title.x = element_text(size = 14, face = "bold"),
axis.title.y = element_text(size = 14, face = "bold"),
legend.title = element_text(face = "bold"),
legend.key = element_blank(),
legend.position = "bottom")
g7