The purpose of this project is to analyze the data set from the National Longitudinal Survey of Youth (1997 cohort) in order to determine if there is a signifcant difference in income between men and women. We will also determine if any discrepancy in income varies based on other factors in the data (e.g., education, profession, criminal history, marriage status, etc.).
The insights that I gained show that gender certainly has an impact on income. I am confident that this analysis shows some significiant directional relationships between gender and income, but I am not very confident in the model's predictive power, and even less confident about drawing causal conclusions from the analysis. This study would need to have more data and controlled testing to distinguish causality vs correlation.
The first step is to import and put our data into a useful format. Below we read in the data and rename the columns with meaningful variable names:
# import raw csv data from working directory
nlsy <- read.csv(paste(getwd(),"nlsy97_income.csv", sep = "/"), header=TRUE, )
# Change column names more descriptive names
colnames(nlsy) <- c("incar.num","incar.age.first","incar.months.longest","pubid","agree.teacher.good","agree.feel.safe","days.religious","perc.chance.flu","perc.chance.hs.diploma.by.20","perc.chance.jail.by.20","perc.chance.parent.by.20","perc.chance.college.by.30","gender","birth.month","birth.year","limitations.school.work","depressed.female","depressed.male","enrollment.status","net.worth.from.parent.1997","sample","race","grades.8th","grades.hs","drug.use.1998","days.marijuana.used.last.30.2000","disorganized","trustful","net.worth.from.r.2003","weight.2004","weight.description.2004","resp.help.less.fortunate","resp.take.care.self","helping.people.important","resp.look.after.self","college.public.private","income.family","household.size","household.mem.under.18","household.mem.under.6","highest.degree.asof.2011","marital.status","highest.grade.asof.2011","income.yes.no","income","income.spouse.yes.no","income.spouse","income.spouse.est","height.ft","height.inch","weight.2011","smoked.since.last.interview","alcohol.since.last.interview","days.marijuana.used.last.30.2011","drug.use.since.last.interview","weight.description.2011","weight.goal.2011","energy.past.4.wk","work.industry","score.sat.math","score.sat.verbal","score.act","debt.age.20","num.employee.jobs.14.to.20","num.employee.jobs.since.20","num.jobs.since.20","debt.age.30")
Initial Observation Count is: 8984
The data has 5 negative integer values (-5 to -1) that are reserved for survey questions without answers. We want to remove these values and replace them with NA
so that they do not interfere with regression. However, we don’t want to replace all negative values with NA
since there are some fields with legitimate negative values (e.g. net worth can be negative).
Therefore, we will map -1 through -5 to NA
for numeric columns and “?” for categorical columns:
# Map special values to NA in every NUMERIC column: Refusal(-1), Don't Know(-2), Invalid Skip (-3),VALID SKIP(-4), NON-INTERVIEW(-5)
nlsy[var.numeric] <- lapply(nlsy[var.numeric], FUN= function(x) replace(x, x %in% c(-1,-2,-3,-4,-5), NA))
# Map special values to "?"" in every FACTOR column: Refusal(-1), Don't Know(-2), Invalid Skip (-3),VALID SKIP(-4), NON-INTERVIEW(-5)
nlsy[var.factor] <- lapply(nlsy[var.factor], FUN= function(x) replace(x, x %in% c(-1,-2,-3,-4,-5), "?"))
Map all categorical variables to factors with meaningful level names. Below I show code for only 2 variables (gender and race) as an example, but in the RMD you will see I suppressed code that maps over 30 factor variables, all with custom level names.
# assign meaningful factor names
nlsy <- transform(nlsy,
gender = factor(mapvalues(gender,
c(1, 2),
c("Male", "Female")),
levels= c("Male", "Female")
),
race = factor(mapvalues(race,
c(1, 2, 3, 4),
c("Black", "Hispanic", "Mixed", "Non-Black/NonHispanic")),
levels= c("Non-Black/NonHispanic", "Black", "Hispanic", "Mixed", "?")
)
)
After analyzing the data, it becomes apparent that several factor variables have too many levels with very small counts, so it makes sense to re-level them into broader categories:
# add my own simplified level factor variables
nlsy <- transform(nlsy, ever.married = factor(mapvalues(nlsy[["marital.status"]],
c("Never-married", "Married", "Separated", "Divorced",
"Widowed", "?"),
c("Never Married", "Married", "Married", "Married", "Married",
"Never Married")),
levels= c("Never Married", "Married")
),
work.industry.food = factor(mapvalues(work.industry,
c("Prof Services", "Construction", "Edu/Health",
"Rtl Trade", "Manufacturing", "Food",
"Finance, Ins, RE", "?", "Public Admin", "Other Service",
"Transportation", "Whlsle Trade", "Info & Comm", "Mining",
"Utilities", "Agriculture", "Military", "Special Code"),
c(rep("No",5), "Yes", rep("No",12)))
),
degree.post.hs = factor(mapvalues(highest.degree.asof.2011,
c("None", "GED", "High School", "Associate",
"Bachelor", "Master", "PhD", "Professional","?"),
c("No", "No", "No", "Yes", "Yes", "Yes", "Yes",
"Yes","No")),
levels = c("No","Yes")
),
race.black = factor(mapvalues(race,
c("Black", "Hispanic", "Mixed", "Non-Black/NonHispanic"),
c("Black", "NonBlack", "NonBlack", "NonBlack")),
levels= c("NonBlack", "Black")
)
)
Now we need to address the “topcoding” issues. For the income
variable we know that the top 2% were given the average value of thetop 2%, which is easily seen in a histogram.
# plot histogram of income
qplot(income, geom = "histogram", binwidth=5000, data = nlsy)
There are 120 top-coded values with the exact income amount of $146002; we will exclude those outliers from the regression data. We may have kept them if their actual value were known, but top-coded values can interfere with out regression analysis, so we need to remove them.
This removal does narrow the scope of our analysis to exclude very high earners. Given the data constraints, this is an acceptable reduction in scope.
# remove top-coded incomes
nlsy <- nlsy[-1*which(nlsy$income==max(nlsy$income, na.rm = TRUE)), ]
qplot(x=income, geom = "histogram", binwidth=5000, data = nlsy)
Above we can see we successfully removed the top-coded values.
New Observation Count is: 8864
We also want to cleanup some some odd values and extreme outliers in other columns:
# make top-coded family incomes NA
nlsy[which(nlsy$income.family==max(nlsy$income.family, na.rm = TRUE)), "income.family" ] <- NA
nlsy[which(nlsy$income.spouse==max(nlsy$income.spouse, na.rm = TRUE)), "income.spouse" ] <- NA
# make highest grade NA for value 95
nlsy[which.max(nlsy$highest.grade.asof.2011), "highest.grade.asof.2011" ] <- NA
# make extremely low outlier for net.worth = NA
nlsy[which.min(nlsy$net.worth.from.parent.1997), "net.worth.from.parent.1997" ] <- NA
# make top-coded family net worth incomes = NA
nlsy[which.max(nlsy$net.worth.from.parent.1997), "net.worth.from.parent.1997" ] <- NA
# data within income.yes.no is slightly inconsistent with NAs in income. Fix it.
nlsy[which(is.na(nlsy$income) | 0), "income.yes.no"] <- "No"
nlsy[which(nlsy$income>0), "income.yes.no"] <- "Yes"
If we take a quick look at the data, we actually see that a large portion of the data had either zero or missing income data:
kable(ddply(nlsy, ~ income.yes.no, summarize, count=length(income), mean.income=mean(income, na.rm = TRUE)), format = "markdown")
income.yes.no | count | mean.income |
---|---|---|
Yes | 5182 | 31573.81 |
No | 3682 | NaN |
Since the purpose of this project is to analyze differences in income between men and women, we can remove those rows without income from our dataset:
nlsy <- nlsy[which(nlsy$income.yes.no=="Yes") , ]
New Observation Count is: 5182.
Now that we’ve cleaned our data, let’s take a look at some basic counts, means, and boxplots.
The income gap we are trying to analyze is easiest shown like this:
counts.means <- ddply(nlsy, ~ gender, summarize, count=length(income), mean.income=mean(income, na.rm = TRUE))
kable(counts.means, format = "markdown")
gender | count | mean.income |
---|---|---|
Male | 2701 | 34309.89 |
Female | 2481 | 28595.11 |
Our goal is to try and analyze this income gap of $5714.78.
Visually, a boxplot can show us an even better summary of the data, since it shows the distribution. The box represents the interquartile range (IQR, or 25th to 75th percentile), the horizontal line represents the median, the vertical lines (whiskers) show 1.5 times the IQR, and the dots are outliers.
qplot(gender, income, geom = "boxplot", data = nlsy, na.rm=TRUE, fill=gender)+ scale_fill_manual(values=c("#79a6c8", "#f67f7f"))
The above boxplot gives us similar information about the gender gap, but also shows the distribution. This type of visualization is key in the Methodology section.
One common variable used in this type of analysis is race, so let’s take a look at a race breakout:
qplot(race, income, geom = "boxplot", data = nlsy, na.rm=TRUE, fill=gender)+ scale_fill_manual(values=c("#79a6c8", "#f67f7f"))
From the above plot, it looks like women of mixed raced may be penalized the most, while black women are penalized the least for being a woman. Althought the gender income gap is the smallest for the black bucket, black also has the lowest income out of all the buckets.
Before we begin presenting the data in more complex ways, let’s first explain the methodology used.
The code below shows how I first tackled the data after initial cleanup. For each categorical variable, I was able to get the following:
This resulted in over 60 plots that I was able to visually analyze to understand the data. Refer to RMD code to see how I achieved to plot everything at once.
I will not show the output of all 60+ plots since it is quite extensive, but I will show select examples of my takeaways.
When discussing the income gender gap, many people think the difference is wholly attributable to the industries that men tend to work in have higher pay. Before we analyze that claim, let’s see the dsitribution of gender across industries:
# Categorical variables: bar chart counts
lapply("work.industry", FUN = function(iterate)
ggplot(data=nlsy, aes_string(x="gender", fill="gender")) +
geom_bar(na.rm=TRUE, stat ="bin", position = "dodge") + facet_wrap(as.formula(paste("~", iterate))) +
scale_fill_manual(values=c("#79a6c8", "#f67f7f")) +
ggtitle(paste(iterate," (NA count = ", sum(is.na(nlsy[[iterate]])), ")", sep = ""))
)
Work industry does seem to be split unevenly among gender..
The stark differences seen above may or may not be useful in our model. We also may need to summarize this variable into broader buckets during modelling.
An example of an interesting categorical variable was marital status. The variable shown below is the simplified version of the marital status variable, which indidcates if the respondent was Never Married vs. Married/Widowed/Divorced/etc. (note: Numbers shown in the boxes are the count of data points, located at the mean)
# this function returns the count of each boxplot, and places the count at the mean on the plot
# I found this function on stackoverflow
# citation: http://stackoverflow.com/questions/3483203/create-a-boxplot-in-r-that-labels-a-box-with-the-sample-size-n
count.n <- function(x){
return(c(y = mean(x), label = length(x)))
}
# Categorical variables: boxplots
lapply("ever.married", FUN = function(iterate)
ggplot(data=nlsy, aes_string(x="gender", y="income", fill="gender")) +
geom_boxplot(na.rm=TRUE) + facet_wrap(as.formula(paste("~", iterate))) +
scale_fill_manual(values=c("#79a6c8", "#f67f7f")) +
ggtitle(paste(iterate," (NA count = ", sum(is.na(nlsy[[iterate]])), ")", sep = "")) +
stat_summary(fun.data = count.n, geom = "text")
)
Income seemed to change significantly based on the interaction of gender and marital status.
An example of a numeric variable that caught my eye was household size, particularly for members under age of 18.
# Numeric variables: dot plot with linear regression
lapply("household.mem.under.18", FUN = function(iterate) ggplot(data=nlsy, aes_string(x=iterate, y="income", colour="gender")) +
scale_colour_manual(values=c("#79a6c8", "#f67f7f")) +
geom_point(na.rm=TRUE) +
geom_smooth(method='lm',formula=y~x) +
ggtitle(paste(iterate," (NA count = ", sum(is.na(nlsy[[iterate]])), ")", sep = ""))
)
Females showed drastically lower incomes and men showed slighlty higher incomes for larger numbers of household members under the age of 18.
The following graphs show household members under 18 against income, cut by gender (color), and split by “never married”" vs “Married”. Interestingly, it looks like males are virtually unaffected by increased household size if they are married. Females that never married seemed to fare worse than marrried females.
# Numeric variables: dot plot with linear regression
lapply("household.mem.under.18", FUN = function(iterate) ggplot(data=nlsy, aes_string(x=iterate, y="income", colour="gender")) +
scale_colour_manual(values=c("#79a6c8", "#f67f7f")) +
geom_point(na.rm=TRUE) +
geom_smooth(method='lm',formula=y~x) +
facet_wrap(~ever.married) +
ggtitle(paste(iterate," (NA count = ", sum(is.na(nlsy[[iterate]])), ")", sep = ""))
)
I expected the highest.degree.asof.2011
variable to be very telling, but surprisingly it didn’t seem to useful in explaining the gender gap. I suspected it had too many levels, so I created another variable with less factors called degree.post.hs
with Yes
or No
values. Below you can see the the difference in the gender gap is more apparent for those respondents that did not receive a degree post high school:
During regression modelling, I also found that the race
variable was also creating problems when combined with other variables. However, I felt that all the variables I had were useful, but that there was not enough data to satisfy every race break for every other factor. Therefore, I releveled race
into it’s two most significant buckets, “Black” and “NonBlack”, in a variable called raceblack
:
The gender gap difference here is much more apparent with large samples in each bucket, and therefore more useful for modelling purposes.
In total, variables that looked significant to me were the following:
Categorical:
Numeric:
Let’s add them to vectors for easy use:
var.factor.model <- c("gender", "race", "race.black", "work.industry.food", "smoked.since.last.interview", "degree.post.hs","ever.married")
var.numeric.model <- c("household.mem.under.18", "income.family", "income.spouse", "net.worth.from.parent.1997", "net.worth.from.r.2003", "incar.num")
In the next section, we will use statistical analysis to see if we can make any findings.
The first thing I want to do is check for collinearity among our variables.
# Function taken from ?pairs Example section.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y, use='pairwise')) # enabled pairwise option
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = pmax(1, cex.cor * r))
}
# Use panel.cor to display correlations in lower panel.
pairs(nlsy[ , var.numeric.model], lower.panel = panel.cor)
Spousal income and family income are highly correlated. Based on prior research I am choosing to remove spousal income.
# remove element
var.numeric.model <- var.numeric.model[! var.numeric.model %in% c("income.spouse")]
Next, I want to run a data summary broken by gender to make sure to validate that the variables are statistically significant across gender:
# generate summary of numeric variables
kable(generateDataSummary(nlsy, var.numeric.model, "gender"), format="markdown")
Variable | Missing | Male | Female | P-Value |
---|---|---|---|---|
household.mem.under.18 | 12 | 0.83 (1.16) | 1.2 (1.29) | 0 |
income.family | 568 | 61236.25 (37158.02) | 63903.92 (39937.76) | 0.0192 |
net.worth.from.parent.1997 | 1284 | 95762.03 (137133.39) | 96533.08 (140889.49) | 0.8627 |
net.worth.from.r.2003 | 3916 | 10409.39 (27254.76) | 12105.33 (41010.21) | 0.3948 |
incar.num | 0 | 0.18 (0.65) | 0.04 (0.29) | 0 |
Unforturnately, it looks like I need to throw out both net worth variables from my analysis. The P-values are greater than our threshold of 0.05, so we should not use them in the regression analysis.
I also want to remove income.family
since it has too many missing values. Even if income.family
had enough data, I probably would have needed to remove it anyway. Originally, I thought the variable indicated family income at childhood, but since the variable is from 2011 data it is more likely that it is simply the sum of the respondent’s income plus other household members’ income, which is not very useful. It is probably too closely related to the independent variable.
Let’s remove them:
# remove element
var.numeric.model <- var.numeric.model[! var.numeric.model %in% c("net.worth.from.r.2003", "net.worth.from.parent.1997", "income.family", "income.spouse")]
Now it’s time to start our regression. First I want to cut down the dataset for modelling so I can omit any row with NA
:
# cut down to relevant columns
nlsy.trim <- nlsy[ , c("income",var.numeric.model, var.factor.model)]
# cut down to non-NA rows
nlsy.trim <- na.omit(nlsy.trim)
Final regression model observation count excluding all NAs: 5170
Let’s create an initial linear model using just the gender
variable:
all.lm00 <- lm(income ~ gender, data = nlsy.trim)
kable(round(summary(all.lm00)$coef, 3), format = "markdown")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 34326.285 | 372.019 | 92.270 | 0 |
genderFemale | -5702.005 | 537.571 | -10.607 | 0 |
Looks good so far. We have a P-value of 0 for gender
and a coefficient of -5702.005, which represents the income gap ignoring all other possible factors.
Now let’s add more variables to see if we can explain the income gap.
# add race
all.lm01 <- update(all.lm00, . ~ . + race)
anova(all.lm00, all.lm01)
## Analysis of Variance Table
##
## Model 1: income ~ gender
## Model 2: income ~ gender + race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5168 1.9269e+12
## 2 5165 1.8837e+12 3 4.3185e+10 39.471 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Good, the ANOVA test shows that the addition of race makes our model significantly differenent. I used ANOVA throughout my iterations of models but I will not show every output.
Let’s continue adding variables:
# add marriage
all.lm02 <- update(all.lm01, . ~ . + ever.married)
#anova(all.lm01, all.lm02)
kable(round(summary(all.lm02)$coef, 3), format = "markdown")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 33884.110 | 509.573 | 66.495 | 0.000 |
genderFemale | -5843.250 | 528.158 | -11.063 | 0.000 |
raceBlack | -5741.965 | 668.197 | -8.593 | 0.000 |
raceHispanic | -3322.324 | 665.224 | -4.994 | 0.000 |
raceMixed | 1271.393 | 2725.161 | 0.467 | 0.641 |
ever.marriedMarried | 5466.192 | 537.475 | 10.170 | 0.000 |
Unfortunately we have an insignificant varibale for raceMixed
, since it has a P-value of 0.641. This is probably due to small sample size in that bucket. Let’s switch race
out with the simpler variable race.black
:
#remove race and add race.black
all.lm03 <- update(all.lm02, . ~ . - race + race.black)
kable(round(anova(all.lm02, all.lm03), 3))
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
5164 | 1.846691e+12 | NA | NA | NA | NA |
5166 | 1.855852e+12 | -2 | -9161153812 | 12.809 | 0 |
kable(round(summary(all.lm03)$coef, 3), format = "markdown")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 32901.720 | 467.186 | 70.425 | 0 |
genderFemale | -5850.355 | 529.349 | -11.052 | 0 |
ever.marriedMarried | 5571.017 | 538.272 | 10.350 | 0 |
race.blackBlack | -4787.516 | 639.176 | -7.490 | 0 |
Looks great! Let’s keep iterating that process:
# add degree.post.hs
all.lm04 <- update(all.lm03, . ~ . + degree.post.hs)
#summary(all.lm04)
# add work.industry.food
all.lm05 <- update(all.lm04, . ~ . + work.industry.food)
#summary(all.lm05)
# add smoked.since.last.interview
all.lm06 <- update(all.lm05, . ~ . + smoked.since.last.interview)
kable(round(summary(all.lm06)$coef, 3), format = "markdown")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 28422.942 | 548.577 | 51.812 | 0.000 |
genderFemale | -7507.578 | 503.518 | -14.910 | 0.000 |
ever.marriedMarried | 4776.841 | 508.933 | 9.386 | 0.000 |
race.blackBlack | -3773.635 | 606.201 | -6.225 | 0.000 |
degree.post.hsYes | 11346.478 | 527.057 | 21.528 | 0.000 |
work.industry.foodYes | -8086.750 | 805.678 | -10.037 | 0.000 |
smoked.since.last.interviewNo | 3064.295 | 533.041 | 5.749 | 0.000 |
smoked.since.last.interview? | -3914.918 | 3171.527 | -1.234 | 0.217 |
The variable smoked.since.last.interview
has missing variables (indicated by “?”) that are not significant, since it has P-value of 0.217. Let’s remove it and see if household.mem.under.18
and incar.num
are useful:
# add household.mem.under.18 and remove smoked.since.last.interview
all.lm07 <- update(all.lm06, . ~ . + household.mem.under.18 - smoked.since.last.interview)
#kable(round(summary(all.lm07)$coef, 3), format = "markdown")
#turned out to be good
# add household.mem.under.18 and remove smoked.since.last.interview
all.lm07 <- update(all.lm07, . ~ . + incar.num)
kable(round(summary(all.lm07)$coef, 3), format = "markdown")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 31367.925 | 515.848 | 60.808 | 0 |
genderFemale | -7310.443 | 512.138 | -14.274 | 0 |
ever.marriedMarried | 5775.241 | 532.558 | 10.844 | 0 |
race.blackBlack | -3145.910 | 604.181 | -5.207 | 0 |
degree.post.hsYes | 10723.261 | 542.107 | 19.781 | 0 |
work.industry.foodYes | -8080.518 | 802.341 | -10.071 | 0 |
household.mem.under.18 | -1073.582 | 221.236 | -4.853 | 0 |
incar.num | -3517.142 | 490.297 | -7.173 | 0 |
Now let’s add some interaction variables with gender
to see if we can find cases where income changes across these variables based on gender.
# add interaction var for race.black and gender
all.lm08 <- update(all.lm07, . ~ . + race.black*gender)
#summary(all.lm08)
# add interaction var for marriage and gender
all.lm09 <- update(all.lm08, . ~ . + ever.married*gender)
#summary(all.lm09)
# add interaction var for household.mem.under.18 and gender
all.lm10 <- update(all.lm09, . ~ . + household.mem.under.18*gender)
#summary(all.lm10)
# add interaction var for education and gender
all.lm11 <- update(all.lm10, . ~ . + degree.post.hs*gender)
kable(round(summary(all.lm11)$coef, 3), format = "markdown")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 29878.862 | 607.697 | 49.167 | 0.000 |
genderFemale | -3434.553 | 957.322 | -3.588 | 0.000 |
ever.marriedMarried | 8194.194 | 741.693 | 11.048 | 0.000 |
race.blackBlack | -5165.663 | 838.081 | -6.164 | 0.000 |
degree.post.hsYes | 9318.626 | 754.018 | 12.359 | 0.000 |
work.industry.foodYes | -7585.502 | 794.284 | -9.550 | 0.000 |
household.mem.under.18 | 465.335 | 322.035 | 1.445 | 0.149 |
incar.num | -3659.436 | 485.759 | -7.533 | 0.000 |
genderFemale:race.blackBlack | 4035.707 | 1199.203 | 3.365 | 0.001 |
genderFemale:ever.marriedMarried | -5346.092 | 1054.818 | -5.068 | 0.000 |
genderFemale:household.mem.under.18 | -3055.363 | 441.831 | -6.915 | 0.000 |
genderFemale:degree.post.hsYes | 2120.623 | 1065.489 | 1.990 | 0.047 |
All of the interactions added turned out to be significant, with the degree.post.hs
and gender
interaction variable having the highest P-value of of 0.047, which is still acceptable for our alpha of 0.05.
The P-value of 0.149 for household.mem.under.18
is acceptable only because the interaction P-value of that variable with gender is significant, with a P-value of 0. We must keep the “low order” base variable for the interaction variable to be present. In any case, the coefficient weight on the non-interaction household.mem.under.18
is low, at only $465
We will use this as our final model.
# pick best model
final.lm <- all.lm11
Now let’s check the model using standard Plot Tests
# Plot
plot(final.lm)
This probably is not an appropriate model for income.
However, our objective is not necessarily to create a perfect model to predict income. We really are just interested in explaining the income gender gap, which is best shown by our interation variables with gender. We’ll discuss this in the next section.
Now let’s display the summary of the model to take a look at the R-squared and the coefficients:
# get model summary
final.lm.summary <- summary(final.lm)
# print summary
kable(round(final.lm.summary$coef, 3), format = "markdown")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 29878.862 | 607.697 | 49.167 | 0.000 |
genderFemale | -3434.553 | 957.322 | -3.588 | 0.000 |
ever.marriedMarried | 8194.194 | 741.693 | 11.048 | 0.000 |
race.blackBlack | -5165.663 | 838.081 | -6.164 | 0.000 |
degree.post.hsYes | 9318.626 | 754.018 | 12.359 | 0.000 |
work.industry.foodYes | -7585.502 | 794.284 | -9.550 | 0.000 |
household.mem.under.18 | 465.335 | 322.035 | 1.445 | 0.149 |
incar.num | -3659.436 | 485.759 | -7.533 | 0.000 |
genderFemale:race.blackBlack | 4035.707 | 1199.203 | 3.365 | 0.001 |
genderFemale:ever.marriedMarried | -5346.092 | 1054.818 | -5.068 | 0.000 |
genderFemale:household.mem.under.18 | -3055.363 | 441.831 | -6.915 | 0.000 |
genderFemale:degree.post.hsYes | 2120.623 | 1065.489 | 1.990 | 0.047 |
This model has many significant variables (P-values below 0.05), but also a very low r.squared value of 19.95%, indicating that the model is only explaining 19.95% of the income variance.
What’s very interesting the the high significance on the gender interation variables and their respective coefficients. For example, although females make $3435 less than males in general, they make even less if they were married, an additional loss of $5346!
Black females are not penalized compared to black males, making $ 4036 more than black males, but blacks overall make $5166 less than other races.
Also, for each child that a female has, she makes $3055 less per child. And females that have a degree post High School tend to make $2121 more than those without the degree.
The insights that I gained show that gender certainly has an impact on income. I am confident that this analysis shows some significiant directional relationships between gender and income, but I am not very confident in the model's predictive power, and even less confident about drawing causal conclusions from the analysis. This study would need to have more data and controlled testing to distinguish causality vs correlation.