Executive Summary

The purpose of this project is to analyze the data set from the National Longitudinal Survey of Youth (1997 cohort) in order to determine if there is a signifcant difference in income between men and women. We will also determine if any discrepancy in income varies based on other factors in the data (e.g., education, profession, criminal history, marriage status, etc.).

The insights that I gained show that gender certainly has an impact on income. I am confident that this analysis shows some significiant directional relationships between gender and income, but I am not very confident in the model's predictive power, and even less confident about drawing causal conclusions from the analysis. This study would need to have more data and controlled testing to distinguish causality vs correlation.

1. Data Summary

(a)Import Data and Rename Columns

The first step is to import and put our data into a useful format. Below we read in the data and rename the columns with meaningful variable names:

# import raw csv data from working directory

nlsy <- read.csv(paste(getwd(),"nlsy97_income.csv", sep = "/"), header=TRUE, )


# Change column names more descriptive names
colnames(nlsy) <- c("incar.num","incar.age.first","incar.months.longest","pubid","agree.teacher.good","agree.feel.safe","days.religious","perc.chance.flu","perc.chance.hs.diploma.by.20","perc.chance.jail.by.20","perc.chance.parent.by.20","perc.chance.college.by.30","gender","birth.month","birth.year","limitations.school.work","depressed.female","depressed.male","enrollment.status","net.worth.from.parent.1997","sample","race","grades.8th","grades.hs","drug.use.1998","days.marijuana.used.last.30.2000","disorganized","trustful","net.worth.from.r.2003","weight.2004","weight.description.2004","resp.help.less.fortunate","resp.take.care.self","helping.people.important","resp.look.after.self","college.public.private","income.family","household.size","household.mem.under.18","household.mem.under.6","highest.degree.asof.2011","marital.status","highest.grade.asof.2011","income.yes.no","income","income.spouse.yes.no","income.spouse","income.spouse.est","height.ft","height.inch","weight.2011","smoked.since.last.interview","alcohol.since.last.interview","days.marijuana.used.last.30.2011","drug.use.since.last.interview","weight.description.2011","weight.goal.2011","energy.past.4.wk","work.industry","score.sat.math","score.sat.verbal","score.act","debt.age.20","num.employee.jobs.14.to.20","num.employee.jobs.since.20","num.jobs.since.20","debt.age.30")

Initial Observation Count is: 8984

(b) Handle Negative Values

The data has 5 negative integer values (-5 to -1) that are reserved for survey questions without answers. We want to remove these values and replace them with NA so that they do not interfere with regression. However, we don’t want to replace all negative values with NA since there are some fields with legitimate negative values (e.g. net worth can be negative).

Therefore, we will map -1 through -5 to NA for numeric columns and “?” for categorical columns:

# Map special values to NA in every NUMERIC column: Refusal(-1), Don't Know(-2), Invalid Skip (-3),VALID SKIP(-4), NON-INTERVIEW(-5)
nlsy[var.numeric] <- lapply(nlsy[var.numeric], FUN= function(x) replace(x, x %in% c(-1,-2,-3,-4,-5), NA))

# Map special values to "?"" in every FACTOR column: Refusal(-1), Don't Know(-2), Invalid Skip (-3),VALID SKIP(-4), NON-INTERVIEW(-5)
nlsy[var.factor] <- lapply(nlsy[var.factor], FUN= function(x) replace(x, x %in% c(-1,-2,-3,-4,-5), "?"))

(c) Fix factor level names for categorical variables

Map all categorical variables to factors with meaningful level names. Below I show code for only 2 variables (gender and race) as an example, but in the RMD you will see I suppressed code that maps over 30 factor variables, all with custom level names.

# assign meaningful factor names
nlsy <- transform(nlsy,
                  gender = factor(mapvalues(gender,
                                            c(1, 2),
                                            c("Male", "Female")),
                                  levels= c("Male", "Female")
                                  ),
                  race = factor(mapvalues(race,
                                            c(1, 2, 3, 4),
                                            c("Black", "Hispanic", "Mixed", "Non-Black/NonHispanic")),
                                  levels= c("Non-Black/NonHispanic", "Black", "Hispanic", "Mixed", "?")
                                  )
                  )

(d) Re-level some factors

After analyzing the data, it becomes apparent that several factor variables have too many levels with very small counts, so it makes sense to re-level them into broader categories:

# add my own simplified level factor variables
nlsy <- transform(nlsy, ever.married = factor(mapvalues(nlsy[["marital.status"]],
                                                    c("Never-married", "Married", "Separated", "Divorced",
                                                      "Widowed", "?"),
                                                  c("Never Married", "Married", "Married", "Married", "Married",
                                                    "Never Married")),
                                               levels= c("Never Married", "Married")
                                             ),

                  work.industry.food = factor(mapvalues(work.industry,
                                            c("Prof Services", "Construction", "Edu/Health",
                                              "Rtl Trade", "Manufacturing", "Food",
                                              "Finance, Ins, RE", "?", "Public Admin", "Other Service",
                                              "Transportation", "Whlsle Trade", "Info & Comm", "Mining",
                                              "Utilities", "Agriculture", "Military", "Special Code"),
                                            c(rep("No",5), "Yes", rep("No",12)))
                                            ),

                  degree.post.hs = factor(mapvalues(highest.degree.asof.2011,
                                                              c("None", "GED", "High School", "Associate",
                                                                "Bachelor", "Master", "PhD", "Professional","?"),
                                                              c("No", "No", "No", "Yes", "Yes", "Yes", "Yes",
                                                                "Yes","No")),
                                          levels = c("No","Yes")
                                          ),

                  race.black = factor(mapvalues(race,
                                            c("Black", "Hispanic", "Mixed", "Non-Black/NonHispanic"),
                                            c("Black", "NonBlack", "NonBlack", "NonBlack")),
                                      levels= c("NonBlack", "Black")
                                     )
                  )

(e) Handle top-coded values and other cleanup

Now we need to address the “topcoding” issues. For the income variable we know that the top 2% were given the average value of thetop 2%, which is easily seen in a histogram.

# plot histogram of income
qplot(income, geom = "histogram", binwidth=5000, data = nlsy)

There are 120 top-coded values with the exact income amount of $146002; we will exclude those outliers from the regression data. We may have kept them if their actual value were known, but top-coded values can interfere with out regression analysis, so we need to remove them.

This removal does narrow the scope of our analysis to exclude very high earners. Given the data constraints, this is an acceptable reduction in scope.

# remove top-coded incomes
nlsy <- nlsy[-1*which(nlsy$income==max(nlsy$income, na.rm = TRUE)), ]
qplot(x=income, geom = "histogram", binwidth=5000, data = nlsy)

Above we can see we successfully removed the top-coded values.

New Observation Count is: 8864

We also want to cleanup some some odd values and extreme outliers in other columns:

# make top-coded family incomes NA
nlsy[which(nlsy$income.family==max(nlsy$income.family, na.rm = TRUE)), "income.family" ] <- NA

nlsy[which(nlsy$income.spouse==max(nlsy$income.spouse, na.rm = TRUE)), "income.spouse" ] <- NA

# make highest grade NA for value 95
nlsy[which.max(nlsy$highest.grade.asof.2011), "highest.grade.asof.2011" ] <- NA

# make extremely low outlier for net.worth = NA
nlsy[which.min(nlsy$net.worth.from.parent.1997), "net.worth.from.parent.1997" ] <- NA

# make top-coded family net worth incomes = NA
nlsy[which.max(nlsy$net.worth.from.parent.1997), "net.worth.from.parent.1997" ] <- NA

# data within income.yes.no is slightly inconsistent with NAs in income. Fix it.
nlsy[which(is.na(nlsy$income) | 0), "income.yes.no"] <- "No"
nlsy[which(nlsy$income>0), "income.yes.no"] <- "Yes"

(f) Remove records without income

If we take a quick look at the data, we actually see that a large portion of the data had either zero or missing income data:

kable(ddply(nlsy, ~ income.yes.no, summarize, count=length(income), mean.income=mean(income, na.rm = TRUE)), format = "markdown")
income.yes.no count mean.income
Yes 5182 31573.81
No 3682 NaN

Since the purpose of this project is to analyze differences in income between men and women, we can remove those rows without income from our dataset:

nlsy <- nlsy[which(nlsy$income.yes.no=="Yes") , ]

New Observation Count is: 5182.

(g) Basic counts/means

Now that we’ve cleaned our data, let’s take a look at some basic counts, means, and boxplots.

The income gap we are trying to analyze is easiest shown like this:

counts.means <- ddply(nlsy, ~ gender, summarize, count=length(income), mean.income=mean(income, na.rm = TRUE))
kable(counts.means, format = "markdown")
gender count mean.income
Male 2701 34309.89
Female 2481 28595.11

Our goal is to try and analyze this income gap of $5714.78.

Visually, a boxplot can show us an even better summary of the data, since it shows the distribution. The box represents the interquartile range (IQR, or 25th to 75th percentile), the horizontal line represents the median, the vertical lines (whiskers) show 1.5 times the IQR, and the dots are outliers.

qplot(gender, income, geom = "boxplot", data = nlsy, na.rm=TRUE, fill=gender)+ scale_fill_manual(values=c("#79a6c8", "#f67f7f"))

The above boxplot gives us similar information about the gender gap, but also shows the distribution. This type of visualization is key in the Methodology section.

One common variable used in this type of analysis is race, so let’s take a look at a race breakout:

qplot(race, income, geom = "boxplot", data = nlsy, na.rm=TRUE, fill=gender)+ scale_fill_manual(values=c("#79a6c8", "#f67f7f"))

From the above plot, it looks like women of mixed raced may be penalized the most, while black women are penalized the least for being a woman. Althought the gender income gap is the smallest for the black bucket, black also has the lowest income out of all the buckets.

Before we begin presenting the data in more complex ways, let’s first explain the methodology used.

2. Methodology

The code below shows how I first tackled the data after initial cleanup. For each categorical variable, I was able to get the following:

(a) Bar chart showing counts of all categorical variables broken out by by gender
(b) Boxplots that showed income cut by both gender for every categorical variable, with counts plotted at the mean. The counts on each box helped me to quickly validate whether the box I was looking at had enough data points to be considered valid.
(c) Dot plots with linear regressions for income vs all numeric variables cut by gender
(d) Combining insights from parts b and c. I facet wrapped several continuous variables by categorical variables to see the interaction effect.
(e) Relevel factors into new variables with broader buckets to reduce issues with small sample buckets.

This resulted in over 60 plots that I was able to visually analyze to understand the data. Refer to RMD code to see how I achieved to plot everything at once.

I will not show the output of all 60+ plots since it is quite extensive, but I will show select examples of my takeaways.

(a) Bar chart example

When discussing the income gender gap, many people think the difference is wholly attributable to the industries that men tend to work in have higher pay. Before we analyze that claim, let’s see the dsitribution of gender across industries:

# Categorical variables: bar chart counts
lapply("work.industry", FUN = function(iterate)
  ggplot(data=nlsy, aes_string(x="gender",  fill="gender")) +
       geom_bar(na.rm=TRUE, stat ="bin", position = "dodge") + facet_wrap(as.formula(paste("~", iterate))) +
       scale_fill_manual(values=c("#79a6c8", "#f67f7f")) +
       ggtitle(paste(iterate," (NA count = ", sum(is.na(nlsy[[iterate]])), ")", sep = ""))
       )

Work industry does seem to be split unevenly among gender..

The stark differences seen above may or may not be useful in our model. We also may need to summarize this variable into broader buckets during modelling.

(b) Box plot example

An example of an interesting categorical variable was marital status. The variable shown below is the simplified version of the marital status variable, which indidcates if the respondent was Never Married vs. Married/Widowed/Divorced/etc. (note: Numbers shown in the boxes are the count of data points, located at the mean)

# this function returns the count of each boxplot, and places the count at the mean on the plot
# I found this function on stackoverflow
# citation: http://stackoverflow.com/questions/3483203/create-a-boxplot-in-r-that-labels-a-box-with-the-sample-size-n
count.n <- function(x){
   return(c(y = mean(x), label = length(x)))
}

# Categorical variables: boxplots
lapply("ever.married", FUN = function(iterate)
  ggplot(data=nlsy, aes_string(x="gender", y="income",  fill="gender")) +
       geom_boxplot(na.rm=TRUE) + facet_wrap(as.formula(paste("~", iterate))) +
       scale_fill_manual(values=c("#79a6c8", "#f67f7f")) +
       ggtitle(paste(iterate," (NA count = ", sum(is.na(nlsy[[iterate]])), ")", sep = "")) +
       stat_summary(fun.data = count.n, geom = "text")
       )

Income seemed to change significantly based on the interaction of gender and marital status.

(c) Dot plot with linear fitted line example

An example of a numeric variable that caught my eye was household size, particularly for members under age of 18.

# Numeric variables: dot plot with linear regression
lapply("household.mem.under.18", FUN = function(iterate) ggplot(data=nlsy, aes_string(x=iterate, y="income", colour="gender")) +
       scale_colour_manual(values=c("#79a6c8", "#f67f7f")) +
       geom_point(na.rm=TRUE) +
       geom_smooth(method='lm',formula=y~x) +
       ggtitle(paste(iterate," (NA count = ", sum(is.na(nlsy[[iterate]])), ")", sep = ""))
       )

Females showed drastically lower incomes and men showed slighlty higher incomes for larger numbers of household members under the age of 18.

(d) Dot plot facet wrapped with categorical variable example

The following graphs show household members under 18 against income, cut by gender (color), and split by “never married”" vs “Married”. Interestingly, it looks like males are virtually unaffected by increased household size if they are married. Females that never married seemed to fare worse than marrried females.

# Numeric variables: dot plot with  linear regression
lapply("household.mem.under.18", FUN = function(iterate) ggplot(data=nlsy, aes_string(x=iterate, y="income", colour="gender")) +
       scale_colour_manual(values=c("#79a6c8", "#f67f7f")) +
       geom_point(na.rm=TRUE) +
       geom_smooth(method='lm',formula=y~x) +
       facet_wrap(~ever.married) +
       ggtitle(paste(iterate," (NA count = ", sum(is.na(nlsy[[iterate]])), ")", sep = ""))
       )

(e) Releveled Factors

I expected the highest.degree.asof.2011 variable to be very telling, but surprisingly it didn’t seem to useful in explaining the gender gap. I suspected it had too many levels, so I created another variable with less factors called degree.post.hs with Yes or No values. Below you can see the the difference in the gender gap is more apparent for those respondents that did not receive a degree post high school:

During regression modelling, I also found that the race variable was also creating problems when combined with other variables. However, I felt that all the variables I had were useful, but that there was not enough data to satisfy every race break for every other factor. Therefore, I releveled race into it’s two most significant buckets, “Black” and “NonBlack”, in a variable called raceblack:

The gender gap difference here is much more apparent with large samples in each bucket, and therefore more useful for modelling purposes.

In total, variables that looked significant to me were the following:

Categorical:

  • gender
  • race
  • race.black
  • work.industry.food
  • smoked.since.last.interview
  • degree.post.hs
  • ever.married

Numeric:

  • household.mem.under.18
  • income.family
  • income.spouse
  • net.worth.from.parent.1997
  • net.worth.from.r.2003
  • incar.num

Let’s add them to vectors for easy use:

var.factor.model <- c("gender", "race", "race.black", "work.industry.food", "smoked.since.last.interview", "degree.post.hs","ever.married")
var.numeric.model <- c("household.mem.under.18", "income.family", "income.spouse", "net.worth.from.parent.1997", "net.worth.from.r.2003", "incar.num")

In the next section, we will use statistical analysis to see if we can make any findings.

3. Findings

Collinearity

The first thing I want to do is check for collinearity among our variables.

# Function taken from ?pairs Example section.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y, use='pairwise')) # enabled pairwise option
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = pmax(1, cex.cor * r))
}

# Use panel.cor to display correlations in lower panel.
pairs(nlsy[ , var.numeric.model], lower.panel = panel.cor)

Spousal income and family income are highly correlated. Based on prior research I am choosing to remove spousal income.

# remove element
var.numeric.model <- var.numeric.model[! var.numeric.model %in% c("income.spouse")]

Continuous Variable Significance Across Gender

Next, I want to run a data summary broken by gender to make sure to validate that the variables are statistically significant across gender:

# generate summary of numeric variables
kable(generateDataSummary(nlsy, var.numeric.model, "gender"), format="markdown")
Variable Missing Male Female P-Value
household.mem.under.18 12 0.83 (1.16) 1.2 (1.29) 0
income.family 568 61236.25 (37158.02) 63903.92 (39937.76) 0.0192
net.worth.from.parent.1997 1284 95762.03 (137133.39) 96533.08 (140889.49) 0.8627
net.worth.from.r.2003 3916 10409.39 (27254.76) 12105.33 (41010.21) 0.3948
incar.num 0 0.18 (0.65) 0.04 (0.29) 0

Unforturnately, it looks like I need to throw out both net worth variables from my analysis. The P-values are greater than our threshold of 0.05, so we should not use them in the regression analysis.

I also want to remove income.family since it has too many missing values. Even if income.family had enough data, I probably would have needed to remove it anyway. Originally, I thought the variable indicated family income at childhood, but since the variable is from 2011 data it is more likely that it is simply the sum of the respondent’s income plus other household members’ income, which is not very useful. It is probably too closely related to the independent variable.

Let’s remove them:

# remove element
var.numeric.model <- var.numeric.model[! var.numeric.model %in% c("net.worth.from.r.2003", "net.worth.from.parent.1997", "income.family", "income.spouse")]

Regression

Now it’s time to start our regression. First I want to cut down the dataset for modelling so I can omit any row with NA:

# cut down to relevant columns
nlsy.trim <- nlsy[ , c("income",var.numeric.model, var.factor.model)]

# cut down to non-NA rows
nlsy.trim <- na.omit(nlsy.trim)

Final regression model observation count excluding all NAs: 5170

Let’s create an initial linear model using just the gender variable:

all.lm00 <- lm(income ~ gender, data = nlsy.trim)
kable(round(summary(all.lm00)$coef, 3), format = "markdown")
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34326.285 372.019 92.270 0
genderFemale -5702.005 537.571 -10.607 0

Looks good so far. We have a P-value of 0 for gender and a coefficient of -5702.005, which represents the income gap ignoring all other possible factors.

Now let’s add more variables to see if we can explain the income gap.

# add race
all.lm01 <- update(all.lm00, . ~ . + race)
anova(all.lm00, all.lm01)
## Analysis of Variance Table
##
## Model 1: income ~ gender
## Model 2: income ~ gender + race
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)
## 1   5168 1.9269e+12
## 2   5165 1.8837e+12  3 4.3185e+10 39.471 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Good, the ANOVA test shows that the addition of race makes our model significantly differenent. I used ANOVA throughout my iterations of models but I will not show every output.

Let’s continue adding variables:

# add marriage
all.lm02 <- update(all.lm01, . ~ . + ever.married)
#anova(all.lm01, all.lm02)
kable(round(summary(all.lm02)$coef, 3), format = "markdown")
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33884.110 509.573 66.495 0.000
genderFemale -5843.250 528.158 -11.063 0.000
raceBlack -5741.965 668.197 -8.593 0.000
raceHispanic -3322.324 665.224 -4.994 0.000
raceMixed 1271.393 2725.161 0.467 0.641
ever.marriedMarried 5466.192 537.475 10.170 0.000

Unfortunately we have an insignificant varibale for raceMixed, since it has a P-value of 0.641. This is probably due to small sample size in that bucket. Let’s switch race out with the simpler variable race.black:

#remove race and add race.black
all.lm03 <- update(all.lm02, . ~ . - race + race.black)
kable(round(anova(all.lm02, all.lm03), 3))
Res.Df RSS Df Sum of Sq F Pr(>F)
5164 1.846691e+12 NA NA NA NA
5166 1.855852e+12 -2 -9161153812 12.809 0
kable(round(summary(all.lm03)$coef, 3), format = "markdown")
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32901.720 467.186 70.425 0
genderFemale -5850.355 529.349 -11.052 0
ever.marriedMarried 5571.017 538.272 10.350 0
race.blackBlack -4787.516 639.176 -7.490 0

Looks great! Let’s keep iterating that process:

# add degree.post.hs
all.lm04 <- update(all.lm03, . ~ . + degree.post.hs)
#summary(all.lm04)

# add work.industry.food
all.lm05 <- update(all.lm04, . ~ . + work.industry.food)
#summary(all.lm05)

# add smoked.since.last.interview
all.lm06 <- update(all.lm05, . ~ . + smoked.since.last.interview)
kable(round(summary(all.lm06)$coef, 3), format = "markdown")
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28422.942 548.577 51.812 0.000
genderFemale -7507.578 503.518 -14.910 0.000
ever.marriedMarried 4776.841 508.933 9.386 0.000
race.blackBlack -3773.635 606.201 -6.225 0.000
degree.post.hsYes 11346.478 527.057 21.528 0.000
work.industry.foodYes -8086.750 805.678 -10.037 0.000
smoked.since.last.interviewNo 3064.295 533.041 5.749 0.000
smoked.since.last.interview? -3914.918 3171.527 -1.234 0.217

The variable smoked.since.last.interview has missing variables (indicated by “?”) that are not significant, since it has P-value of 0.217. Let’s remove it and see if household.mem.under.18 and incar.num are useful:

# add household.mem.under.18 and remove smoked.since.last.interview
all.lm07 <- update(all.lm06, . ~ . + household.mem.under.18 - smoked.since.last.interview)
#kable(round(summary(all.lm07)$coef, 3), format = "markdown")
#turned out to be good

# add household.mem.under.18 and remove smoked.since.last.interview
all.lm07 <- update(all.lm07, . ~ . + incar.num)
kable(round(summary(all.lm07)$coef, 3), format = "markdown")
Estimate Std. Error t value Pr(>|t|)
(Intercept) 31367.925 515.848 60.808 0
genderFemale -7310.443 512.138 -14.274 0
ever.marriedMarried 5775.241 532.558 10.844 0
race.blackBlack -3145.910 604.181 -5.207 0
degree.post.hsYes 10723.261 542.107 19.781 0
work.industry.foodYes -8080.518 802.341 -10.071 0
household.mem.under.18 -1073.582 221.236 -4.853 0
incar.num -3517.142 490.297 -7.173 0

Now let’s add some interaction variables with gender to see if we can find cases where income changes across these variables based on gender.

# add interaction var for race.black and gender
all.lm08 <- update(all.lm07, . ~ . + race.black*gender)
#summary(all.lm08)

# add interaction var for marriage and gender
all.lm09 <- update(all.lm08, . ~ . + ever.married*gender)
#summary(all.lm09)

# add interaction var for household.mem.under.18 and gender
all.lm10 <- update(all.lm09, . ~ . + household.mem.under.18*gender)
#summary(all.lm10)

# add interaction var for education and gender
all.lm11 <- update(all.lm10, . ~ . + degree.post.hs*gender)
kable(round(summary(all.lm11)$coef, 3), format = "markdown")
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29878.862 607.697 49.167 0.000
genderFemale -3434.553 957.322 -3.588 0.000
ever.marriedMarried 8194.194 741.693 11.048 0.000
race.blackBlack -5165.663 838.081 -6.164 0.000
degree.post.hsYes 9318.626 754.018 12.359 0.000
work.industry.foodYes -7585.502 794.284 -9.550 0.000
household.mem.under.18 465.335 322.035 1.445 0.149
incar.num -3659.436 485.759 -7.533 0.000
genderFemale:race.blackBlack 4035.707 1199.203 3.365 0.001
genderFemale:ever.marriedMarried -5346.092 1054.818 -5.068 0.000
genderFemale:household.mem.under.18 -3055.363 441.831 -6.915 0.000
genderFemale:degree.post.hsYes 2120.623 1065.489 1.990 0.047

All of the interactions added turned out to be significant, with the degree.post.hs and gender interaction variable having the highest P-value of of 0.047, which is still acceptable for our alpha of 0.05.

The P-value of 0.149 for household.mem.under.18 is acceptable only because the interaction P-value of that variable with gender is significant, with a P-value of 0. We must keep the “low order” base variable for the interaction variable to be present. In any case, the coefficient weight on the non-interaction household.mem.under.18 is low, at only $465

We will use this as our final model.

# pick best model
final.lm <- all.lm11

Plot Tests

Now let’s check the model using standard Plot Tests

# Plot
plot(final.lm)

  • The residual vs fitted plot did not turn out so well, because there is definitely a trend in the data.
  • The Q-Q plot is also showing that the residual distribution is not quite normal, as we would like it to be.
  • The residuals vs leverage plot labeled a few outliers that could be skewing our regression

This probably is not an appropriate model for income.

However, our objective is not necessarily to create a perfect model to predict income. We really are just interested in explaining the income gender gap, which is best shown by our interation variables with gender. We’ll discuss this in the next section.

4. Discussion

R-squared and coefficient analysis

Now let’s display the summary of the model to take a look at the R-squared and the coefficients:

# get model summary
final.lm.summary <- summary(final.lm)

# print summary
kable(round(final.lm.summary$coef, 3), format = "markdown")
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29878.862 607.697 49.167 0.000
genderFemale -3434.553 957.322 -3.588 0.000
ever.marriedMarried 8194.194 741.693 11.048 0.000
race.blackBlack -5165.663 838.081 -6.164 0.000
degree.post.hsYes 9318.626 754.018 12.359 0.000
work.industry.foodYes -7585.502 794.284 -9.550 0.000
household.mem.under.18 465.335 322.035 1.445 0.149
incar.num -3659.436 485.759 -7.533 0.000
genderFemale:race.blackBlack 4035.707 1199.203 3.365 0.001
genderFemale:ever.marriedMarried -5346.092 1054.818 -5.068 0.000
genderFemale:household.mem.under.18 -3055.363 441.831 -6.915 0.000
genderFemale:degree.post.hsYes 2120.623 1065.489 1.990 0.047

This model has many significant variables (P-values below 0.05), but also a very low r.squared value of 19.95%, indicating that the model is only explaining 19.95% of the income variance.

What’s very interesting the the high significance on the gender interation variables and their respective coefficients. For example, although females make $3435 less than males in general, they make even less if they were married, an additional loss of $5346!

Black females are not penalized compared to black males, making $ 4036 more than black males, but blacks overall make $5166 less than other races.

Also, for each child that a female has, she makes $3055 less per child. And females that have a degree post High School tend to make $2121 more than those without the degree.

The insights that I gained show that gender certainly has an impact on income. I am confident that this analysis shows some significiant directional relationships between gender and income, but I am not very confident in the model's predictive power, and even less confident about drawing causal conclusions from the analysis. This study would need to have more data and controlled testing to distinguish causality vs correlation.