Part 1: Data

The data collected by General Social Survey (GSS) Cumulative File 1972-2012 seems to be a randomly selected sample of the US population. Between 1972 - 2012, the population of the US was approximately 209 - 314 million.

I looked at several locations, but only found details about the sampling on Wikipedia. The survey is voluntary and took about 90 minutes, so I do have reservations. Also, the absolute and relative number of surveys gathered each year seems to vary.

This was an observational study, as no hypothesis, controls, nor confounding variables were specified beforehand. We can use this data to generalize and draw some trends in the United States, but not make causal arguments.

Part 2: Research Question

Looking at surveys from only females, I wanted to see if any trends emerge over time of the number of children and highest level education completed.

Does the average number of children vary by highest education level completed?

Part 3: Exploratory Data Analysis

I first removed the NAs, filtered for only females, and selected the three columns I am observing, year, degree, and childs. Here is the summary.

      year                 degree          childs     
 Min.   :1972   Lt High School: 6656   Min.   :0.000  
 1st Qu.:1983   High School   :16910   1st Qu.:1.000  
 Median :1993   Junior College: 1790   Median :2.000  
 Mean   :1992   Bachelor      : 4168   Mean   :2.064  
 3rd Qu.:2002   Graduate      : 1776   3rd Qu.:3.000  
 Max.   :2012                          Max.   :8.000

From the plot below, we can see that a disproportionate number of surveys were filled out by those whose highest level of education was High School. For some years it is a majority of females. Whether or not this is a reflection of the true US population is unknown.

From the plot below we can clearly see that females who did not finish high school consistently have the highest average number of children. We can see somewhat of a trend that as the education level increases, the average number of children decreases.

The boxplot below agrees with the statement from above. Females who have not completed high school have a median of 3 children. Those who have completed high school or some college have a median of 2 children. Females who have completed their Bachelor or graduate degrees have a median of 1 children.

The summary statistics of the data are show below. The mean number of children decreases as education level increases.

          Degree Count     Mean Variance  Skewness Kurtosis
1 Lt High School  6656 2.882512 4.713467 0.6998711 2.858344
2    High School 16910 2.006860 2.661494 0.8754421 4.007055
3 Junior College  1790 1.788268 2.213948 0.8523598 4.105106
4       Bachelor  4168 1.399712 2.062894 1.0052714 4.273650
5       Graduate  1776 1.378941 1.847872 0.9349814 4.091965
6        Overall 31300 2.064089 3.180020 0.9999736 4.092638

Part 4: Inference

Conditions for ANOVA:

I. Independence.

The observations are independent based on the GSS sampling method. There is no evidence of any paired data.

II. Approximate Normality

From the box plot and histograms, it seems that for each level of education completed, the spread of average number of children is skewed right, which is what one would expect for number of children in an industrialized country like the United States. To get a more quantitive measure, I calculated the skewness and kurtosis for each group.

The skewness is 0.7 - 1.0 for each group, which follows that they are neither symmetric nor highly skewed, but moderately skewed.

The kurtosis is between 2.9 to 4.3 for each group, indicating leptokurtic data, since the 3 has been subtracted from the value. It is, however, not extreme.

III. Equal Variance

The variances range from 1.85 to 4.71 for the level of education completed. This is not equal, yet not extreme. I have spent a lot of time reading about the importance of this. Because my sample sizes are large, it is perhaps not critically important. Because they are not quite equal, there may be a higher likelihood of Type I Errors.

Hypotheses:

H₀ : The average number of children for females is the same across all education levels.

µ_{Lt High School} = µ_{High School} = µ_{Jr College} = µ_Bachelor = µ_Graduate

H_A : The average number of children for females differs for at least one pair of education levels.

I am testing for 95% significance so α = 0.05.

               Df Sum Sq Mean Sq F value              Pr(>F)    
degree          4   7323  1830.8   621.4 <0.0000000000000002 ***
Residuals   31295  92208     2.9                                
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value is extremely small and less than α, we reject H₀. However, we do not yet know which specific pair is different.

                         Degrees  Center Lower_Limit Upper_Limit Adjusted_p_Value      Significance
1     High School-Lt High School -0.8757     -0.9434     -0.8079       0.00000000         Reject H0
2  Junior College-Lt High School -1.0942     -1.2189     -0.9696       0.00000000         Reject H0
3        Bachelor-Lt High School -1.4828     -1.5753     -1.3903       0.00000000         Reject H0
4        Graduate-Lt High School -1.5036     -1.6286     -1.3785       0.00000000         Reject H0
5     Junior College-High School -0.2186     -0.3350     -0.1022       0.00000298         Reject H0
6           Bachelor-High School -0.6071     -0.6881     -0.5262       0.00000000         Reject H0
7           Graduate-High School -0.6279     -0.7447     -0.5111       0.00000000         Reject H0
8        Bachelor-Junior College -0.3886     -0.5209     -0.2562       0.00000000         Reject H0
9        Graduate-Junior College -0.4093     -0.5661     -0.2525       0.00000000         Reject H0
10             Graduate-Bachelor -0.0208     -0.1535      0.1119       0.99307736 Fail to Reject H0

The p-values are extremely small and less than α for all pairs except Bachelor-Graduate.

Conclusion:

The largest difference of means occurs between Lt High School and, in order, Graduate, Bachelor, Jr. College, and High School. This agrees with what was displayed in the scatter plot in the Exploratory data Analysis.

The data does not strictly fit the assumptions for ANOVA. However, the samples are large, and the p-Values are extremely small.

Therefore, we are 95% confident that the difference in the mean number of children by females of different education levels, except for Bachelor and Graduate, is not 0.

Further studies would be necessary to establish causality.

Appendix

All code for each Part is included here.

Setup

knitr::opts_chunk$set(comment = NA)
options("scipen" = 100)

#List for X-axis ticks.
years <- c(1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010, 2015)

Load Packages.

library(dplyr)
library(ggplot2)
library(grid)
library(gridExtra)
library(moments)
library(scales)
library(statsr)
library(viridis)

Load Data.

load("gss.Rdata")

Part 1: Data

#Count the number of surveys per year.
year_count <- data.frame(gss %>% group_by(year) %>% tally())
#Load US Population data.
us_pop <- read.csv("https://query.data.world/s/gomwflrhpj5ckb66wasrfyo62nhd3t")
#Merge the two above dataframes, create a percent column.
samples <- merge(x = year_count, y = us_pop, by.x = "year", by.y = "year") %>%
    mutate(p = n/population) %>%
    `colnames<-`(c("Year", "Count", "Population", "Percent"))

#Scatter, Year by Count.
g1 <- ggplot(samples, aes(x = Year, y = Count))
g1 <- g1 + geom_point(size = 5, color = viridis(4)[2])
#Title.
g1 <- g1 + ggtitle(label = "")
#X-axis.
g1 <- g1 + scale_x_continuous(name = "",
                              breaks = years,
                              labels = years)
#Y-axis.
g1 <- g1 + scale_y_continuous(name = "Number of Surveys Gathered",
                              labels = comma)
#Modify labels and text. Remove legend from left plot.
g1 <- g1 + theme(plot.title = element_text(hjust = 1, size = 16, face = "bold"),
                 axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
                 axis.title.x = element_text(size = 14, face = "bold"),
                 axis.text.y = element_text(size = 12),
                 axis.title.y = element_text(size = 14, face = "bold"))

#Scatter, Year by Count % of US Population.
g2 <- ggplot(samples, aes(x = Year, y = Percent))
g2 <- g2 + geom_point(size = 5, color = viridis(4)[3])
#Title.
g2 <- g2 + ggtitle(label = "Survey Sampling by Year")
#X-axis.
g2 <- g2 + scale_x_continuous(name = "GSS Survey Year",
                              breaks = years,
                              labels = years)
#Y-axis.
g2 <- g2 + scale_y_continuous(name = "US Population Surveyed",
                              position = "right",
                              labels = percent)
#Modify labels and text. Remove legend from left plot.
g2 <- g2 + theme(plot.title = element_text(hjust = -2.5, size = 16, face = "bold"),
                 axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
                 axis.title.x = element_text(hjust = -0.7, size = 14, face = "bold"),
                 axis.text.y = element_text(size = 12),
                 axis.title.y = element_text(size = 14, face = "bold"))

grid.arrange(g1, g2, layout_matrix = matrix(rbind(c(1, 1, 1, 1, 1, 1, NA, 2, 2, 2, 2, 2, 2)),
                                            ncol = 13))

Part 3: Exploratory Data Analysis

#Remove NAs, filter for only females, and select only "GSS Year", "RS highest degree", and "Number of children".
year_deg_child <- na.omit(gss %>%
                            filter(sex == "Female") %>%
                            select(year, degree, childs))
#Initial Summary.
summary(year_deg_child)

Plot : Count of Education Level by Year (Female Surveys Only)

#Group by year and degree, and count the total number of surveys.
year_deg <- data.frame(year_deg_child %>%
                         group_by(year, degree) %>%
                         tally()) %>%
  `colnames<-`(c("Year", "Degree", "Count"))


#Scatter and Line Plot.
g3 <- ggplot(year_deg, aes(x = Year, y = Count, color = Degree))
g3 <- g3 + geom_point(size = 6) + geom_line(aes(group = Degree), size = 1.3)
#Title.
g3 <- g3 + ggtitle("Count of Education Level by Year (Female Surveys Only)")
#Viridis Color Scheme.
g3 <- g3 + scale_color_viridis_d()
#X-axis.
g3 <- g3 + scale_x_continuous(name = "GSS Survey Year",
                              breaks = years,
                              labels = years)
#Y-axis.
g3 <- g3 + scale_y_continuous(name = "Number of Surveys Gathered")
#Modify labels and text.
g3 <- g3 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
                 axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
                 axis.title.x = element_blank(),
                 axis.text.y = element_text(size = 12),
                 axis.title.y = element_text(size = 14, face = "bold"),
                 legend.title = element_text(face = "bold"),
                 legend.key = element_blank(),
                 legend.position = "bottom")
g3

Plot : Average Number of Children by Degree and Year (Female Surveys Only)

#Calculate the mean number of children by year and degree.
year_deg_child_mean <- aggregate(year_deg_child$childs,
                                 list(year_deg_child$year, year_deg_child$degree),
                                 mean, na.rm = TRUE) %>%
    `colnames<-`(c("Year", "Degree", "Children"))


#Scatter and Line Plot.
g4 <- ggplot(year_deg_child_mean, aes(x = Year, y = Children, color = Degree))
g4 <- g4 + geom_point(size = 6) + geom_line(aes(group = Degree), size = 1.3)
#Title.
g4 <- g4 + ggtitle("Average Number of Children by Degree and Year (Female Surveys Only)")
#Viridis Color Scheme.
g4 <- g4 + scale_color_viridis_d()
#X-axis.
g4 <- g4 + scale_x_continuous(name = "GSS Survey Year",
                              breaks = years,
                              labels = years)
#Y-axis.
g4 <- g4 + scale_y_continuous(name = "Number of Children")
#Modify labels and text.
g4 <- g4 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
                 axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
                 axis.title.x = element_blank(),
                 axis.text.y = element_text(size = 12),
                 axis.title.y = element_text(size = 14, face = "bold"),
                 legend.title = element_text(face = "bold"),
                 legend.key = element_blank(),
                 legend.position = "bottom")
g4

Plot: Spread of Children by Education Level, 1972 - 2012 (Female Surveys Only)

#Spread of children by degree.
deg_child <- year_deg_child %>%
  select(degree, childs) %>%
  `colnames<-`(c("Degree", "Children"))


#BoxPlot.
g5 <- ggplot(deg_child, aes(x = Degree, y = Children, color = Degree))
g5 <- g5 + geom_boxplot(size = 1.3)
#Title.
g5 <- g5 + ggtitle("Spread of Children by Education Level, 1972 - 2012 (Female Surveys Only)")
#Viridis Color Scheme.
g5 <- g5 + scale_color_viridis_d()
#X-axis.
g5 <- g5 + scale_x_discrete(name = "Education Level")
#Y-axis.
g5 <- g5 + scale_y_continuous(name = "Number of Children")
#Modify labels and text.
g5 <- g5 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
                 axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
                 axis.title.x = element_blank(),
                 axis.text.y = element_text(size = 12),
                 axis.title.y = element_text(size = 14, face = "bold"),
                 legend.title = element_text(face = "bold"),
                 legend.position = "none")
g5

Plot : Spread of Children by Education Level, 1972 - 2012 (Female Surveys Only)

#Distribution of children by degree.
g6 <- ggplot(deg_child, aes(x = Children, fill = Degree))
g6 <- g6 + geom_bar(stat = "count")
g6 <- g6 + facet_wrap(~Degree, scales = "free")
#Title.
g6 <- g6 + ggtitle("Spread of Children by Education Level, 1972 - 2012 (Female Surveys Only)")
#Viridis Color Scheme.
g6 <- g6 + scale_fill_viridis_d()
#X-axis
g6 <- g6 + scale_x_continuous("Number of Children")
#Y-axis.
g6 <- g6 + scale_y_continuous(name = "Frequency")
#Modify labels and text.
g6 <- g6 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
                 axis.text.x = element_text(size = 12),
                 axis.title.x = element_text(size = 14, face = "bold"),
                 axis.text.y = element_text(size = 12),
                 axis.title.y = element_text(size = 14, face = "bold"),
                 legend.title = element_text(face = "bold"),
                 legend.position = "none")
g6

Summary Statistics.

#For each degree, determine the number of surveys completed, as well as the mean, variance,
#skewness, and kurtosis of the number of children.
deg_child_stats <- do.call(data.frame,
                      aggregate(year_deg_child$childs,
                                list(year_deg_child$degree),
                                FUN = function(x) c(n = length(x),
                                                    m = mean(x),
                                                    var = var(x),
                                                    sk = skewness(x),
                                                    kt = kurtosis(x)))) %>%
  `colnames<-`(c("Degree", "Count", "Mean", "Variance", "Skewness", "Kurtosis"))

#Bind the same stats for all the data to that which is grouped by degree.
deg_child_stats <- rbind(deg_child_stats,
                         data.frame("Degree" = "Overall",
                                    "Count" = nrow(year_deg_child),
                                    "Mean" = mean(year_deg_child$childs),
                                    "Variance" = var(year_deg_child$childs),
                                    "Skewness" = skewness(year_deg_child$childs),
                                    "Kurtosis" = kurtosis(year_deg_child$childs)))
#Summary of calculations of children by degree.                                    
deg_child_stats

Part 4: Inference

ANOVA :

#ANOVA on children by degree.  5 groups, 10 pairs.
anova1 <- aov(childs ~ degree, data = year_deg_child)
#ANOVA summary.
summary(anova1)

Statistics : 95% Family-Wise Confidence Level.

#Pairwise comparisons.
degree_comparisons <- data.frame((TukeyHSD(anova1))[1]) %>%
  `colnames<-`(c("Center", "Lower_Limit", "Upper_Limit", "Adjusted_p_Value"))

degree_comparisons$Degrees <- rownames(degree_comparisons)
rownames(degree_comparisons) <- NULL

#Round values.
degree_comparisons<- degree_comparisons %>%
  mutate(Degrees = as.factor(Degrees),
         Center = round(Center, 4),
         Lower_Limit = round(Lower_Limit, 4),
         Upper_Limit = round(Upper_Limit, 4),
         Adjusted_p_Value = round(Adjusted_p_Value, 8),
         Significance = ifelse(Adjusted_p_Value > .9, "Fail to Reject H0", "Reject H0")) %>%
  select(5, 1, 2, 3, 4, 6)

#Define factor level to preserve order.
degree_comparisons$Degrees <- factor(degree_comparisons$Degrees,
                                     levels = degree_comparisons$Degrees[order(-as.integer(rownames(degree_comparisons)))])

#Summary of 10 pairs.
print(degree_comparisons, width = 100)

Plot: 95% Family-Wise Confidence Level.

#Confidence Level Plot.
g7 <- ggplot(degree_comparisons, aes(y = Degrees))
#Vertical line at x = 0, no difference in means.
g7 <- g7 + geom_vline(xintercept = 0,
                      linetype = "dashed",
                      color = viridis(6)[1],
                      size = 1)
#Lower Limit, Center, and Upper Limit for each pair.
g7 <- g7 + geom_point(aes(x = Lower_Limit, color = Significance), size = 5, pch = "|")
g7 <- g7 + geom_point(aes(x = Center, color = Significance), size = 3, pch = "|")
g7 <- g7 + geom_point(aes(x = Upper_Limit, color = Significance), size = 5, pch = "|")
#Connect Lower Limit and Upper Limit for each pair.
g7 <- g7 + geom_segment(aes(x = Lower_Limit,
                            y = Degrees,
                            xend = Upper_Limit,
                            yend = Degrees,
                            color = Significance),
                        size = 1.25)
#Title.
g7 <- g7 + ggtitle("95% Family-Wise Confidence Level")
#X-axis.
g7 <- g7 + xlab("Difference in Mean Number of Children by Degree (Female Surveys Only)")
#Y-axis.
g7 <- g7 + ylab("Degree Comparisons")
#Viridis Color Scale.
g7 <- g7 + scale_color_manual(name = "Significance",
                              labels = c("Fail to Reject H0", "Reject H0"),
                              values = c(viridis(5)[3], viridis(5)[4]))

g7 <- g7 + guides(color = guide_legend(reverse=TRUE))

#Viridis for Y-axis ticks.
fontcolor <- rev(ifelse(degree_comparisons$Significance == "Fail to Reject H0", viridis(5)[3], viridis(5)[4]))

#Modify labels and text.
g7 <- g7 + theme(plot.title = element_text(hjust = 0.5, size = 18, face = "bold"),
                 axis.text.x = element_text(size = 11),
                 axis.text.y = element_text(size = 11, angle = 25, color = fontcolor),
                 axis.title.x = element_text(size = 14, face = "bold"),
                 axis.title.y = element_text(size = 14, face = "bold"),
                 legend.title = element_text(face = "bold"),
                 legend.key = element_blank(),
                 legend.position = "bottom")
g7

Statistical Inference with the GSS Data

Rohan Lewis

5/24/2020

Part 1: Data

Part 2: Research Question

Part 3: Exploratory Data Analysis

Part 4: Inference

Conditions for ANOVA:

Hypotheses:

Conclusion:

Appendix

Setup

Load Packages.

Load Data.

Part 1: Data

Part 3: Exploratory Data Analysis

Part 4: Inference