Regression Models
This page is a part of Economic Statistics (ECON304) class taught at Faculty of Economics, Chiang Mai University in Thailand.
This site aim to provide the explanation of Regression Model using R programming.
Students are expected to be able to:
- Compute covariance of two variables
- Compute correlation coefficient of two variables
- Visualise the data of variable
- Conduct t-test for mean testing
- Conduct a simple regression model
- Develop a better model using multiple regression
The followings are the code:
1. Get it started
# Regression model # 1. Install package "AER" install.packages("AER") library(AER) # 2. load the data from package = CPS1985 data(CPS1985)
2. Descriptive Analysis of Data
# 3. Have a look at the CPS data names(CPS1985) # call the variable names str(CPS1985) # show the structure of the variables summary(CPS1985) # show the min qualities and max value ?CPS1985
# in CPS data, # the potential dependent variable is "wage" # let's explore the wage # if wage is our dependent variable, # then we are interested in variation of wage # Calculate stats representing variation var(CPS1985$wage) # calculate variance sd(CPS1985$wage) # calculate standard deviation # Data viz of wage boxplot(CPS1985$wage) # boxplot of wage boxplot(CPS1985$wage, main = "A boxplot of the wage in 1985", col = "blueviolet" ) hist(CPS1985$wage, col = "blueviolet", main = "Histogram of wage in 1985") barplot(CPS1985$wage, col = "blueviolet") # now we will construct the regression model # we have dependent variable as "wage" # so what should it be the independent variable? # let's try with experience # so we shall visualise the relationship plot(CPS1985$experience, CPS1985$wage, main = "Scatter plot between wage and experience", ylab = "Wage (USD/hour)", xlab = "Experience ( years of potential work experience )" , col = "Blue") # plot x&y cov(CPS1985$experience, CPS1985$wage) # covariance cor(CPS1985$experience, CPS1985$wage) # correlation boxplot(CPS1985$experience)3. Regression Model Development
# regression model will look like: # y = a + b*x # E(wage) = alpha + beta*experience model1 <- lm(wage ~ experience, data = CPS1985) # create model1 summary(model1) # E(wage) = 8.37997 + 0.03614*experience # R=square = 0.00757let's interpret the model
- Alpha (Intercept)
alpha = 8.380
alpha is the value of Y when X is 0.
When a labour has no experience, the expected wage will be 8.380 USD/hour- Beta (Slope)
beta = 0.0361
bata show how much Y will change when X increase by 1 unit
When labours have 1 year more experience,
They will get 0.0361 USD/hour increase in their wage- R square
R-square = 0.00757
R-square means the proportion of variation in Y. That can be explained by the predictor(s) in the model. Experience can explain 0.757% of variation in wage. The rest of 99.243% can be explained by other variable###################
4. Variable transformation
# Now we will try to fix the problem # start with normality # wage was found to be positively skew install.packages("psych") library(psych) skew(CPS1985$wage) # calculate skewness # so we have to inform autobot to reduce the size of wage # Bubble bee suggested that we can use square root # so we will create a new variable # that is the square root of wage CPS1985$sqrt_wage <- sqrt(CPS1985$wage) boxplot(CPS1985$sqrt_wage, main = "square root of wage ", col = "blue") boxplot(CPS1985$wage, main = "wage", col = "red") skew(CPS1985$sqrt_wage) # sqrt still results the positive skew # so we use the higher level of transformer # that is log CPS1985$log_wage <- log(CPS1985$wage) boxplot(CPS1985$wage, main = "wage", col = "red") boxplot(CPS1985$sqrt_wage, main = "square root of wage ", col = "blue") boxplot(CPS1985$log_wage, main = "log of wage ", col = "magenta") skew(CPS1985$log_wage) boxplot(CPS1985$wage, CPS1985$sqrt_wage, CPS1985$log_wage) skew(CPS1985$experience) skew(sqrt(CPS1985$experience)) skew(log(CPS1985$experience)) # now we will construct the model using transform variable model2 <- lm(log(wage) ~ sqrt(experience), data = CPS1985) summary(model2) plot(log(CPS1985$wage), sqrt(CPS1985$experience)) ##############################################5. Multiple Regression Model Development
# Now from simple regression # We will develop to multiple regression # The question is that # will the inclusion of education make the model better boxplot(CPS1985$education) skew(CPS1985$education) boxplot((CPS1985$education)^2) skew((CPS1985$education)^2) # so we use just education # now we gonna add education into the model2 # we then create model3 by add education to model2 model3 <- lm(log(wage) ~ sqrt(experience) + education, data = CPS1985) summary(model3) # log(wage) = 0.394 + 0.105*sqrt(exp) + 0.09*edu # R-square is 23.44% # to statistically test if the model3 is better than model2 # we use ANOVA test anova(model2, model3) #### # next add the age boxplot(CPS1985$age) skew(CPS1985$age) # =0.5452207 skew(sqrt(CPS1985$age)) # = 0.2817645 skew(log(CPS1985$age)) # 0.003625903 boxplot(log(CPS1985$age)) model4 <- lm(log(wage) ~ sqrt(experience) + education + log(age), data = CPS1985) summary(model4) # R-squared: 0.2423 anova(model3, model4) # to test assumption plot(model4) # we gonna plot all 4 in the same time par(mfrow=c(2, 2)) # split window to 2x2 scale plot(log(CPS1985$wage), log(CPS1985$age)) model5 <- lm(log(wage) ~ log(age) + age, data = CPS1985) summary(model5) model6 <- lm(wage ~ education + experience + education*experience, data = CPS1985) summary(model6) model7 <- lm(wage ~ experience*age, data = CPS1985) summary(model7) #########################6. Output Visualisation
# finally we will viasualise the output CPS1985$sqrt_experience <- sqrt(CPS1985$experience) model2 <- lm(log_wage ~ sqrt_experience , data = CPS1985) summary(model2) plot(CPS1985$sqrt_experience, CPS1985$log_wage) abline(model2, col = "red")