Checking model assumptions and fit multiple linear regression

BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for Normal Theory Regression - version 16 Page 1 of 26

Unit 9

I- Simple Linear Regression ………….………….…………………….. 1. Introduction to Example …………………..………………………. 2. Preliminaries: Descriptives ………………….……………………. 3. Model Fitting (Estimation) ………………………………………… 4. Model Examination ………………………………………………… 5. Checking Model Assumptions and Fit …………….………………..

II – Multiple Linear Regression ………..……………………………….. 1. Introduction to Example ………………………………..………….. 2. Preliminaries: Descriptives ………………………...…..………….. 3. Handling of Categorical Predictors: Indicator Variables ………….. 4. Model Fitting (Estimation) …………………………………………. 5. Checking Model Assumptions and Fit ……………………………

Data	Data	Data	Statistical

Design	Collection	Management	Summarization	Analysis	Reporting

Setting:
Calls to the New York Auto Club are possibly related to the weather, with more calls occurring during bad weather. This example illustrates descriptive analyses and simple linear regression to explore this hypothesis in a data set containing information on calendar day, weather, and numbers of calls.

Stata Data Set:
ers.dta
In this illustration, the data set ers.dta is accessed from the BIOSTATS 690C course website directly. It is then saved to your current working directory.

. ***** Save the inputted data to the directory you have chosen above . ***** Command is save “NAME”, replace
. save "ers.dta", replace
(note: file ers.dta not found)
file ers.dta saved

Data	Data	Data	Statistical

	Collection	Management	Summarization	Analysis	Reporting

. * Describe data set
. codebook, compact

Variable Obs Unique Mean Min Max Label

stats | low calls
---------+--------------------
N | 28 28
mean | 21.75 4318.75
sd | 13.27383 2692.564
min | -2 1674
max | 41 8947
------------------------------

.

	Data	Data	Data	Statistical	Reporting
	Collection	Management	Summarization	Analysis	Reporting

Data	Data	Data	Statistical

Design	Collection	Management	Summarization	Analysis	Reporting

Data	Data	Data	Statistical

Design	Collection	Management	Summarization	Analysis	Reporting

Variable | Obs W V z Prob>z -------------+-------------------------------------------------- calls | 28 0.82916 5.159 3.378 0.00037

The null hypothesis of normality of Y=calls is rejected (p-value = .00037). Tip- sometimes the cure is worse than the original violation. For now, we’ll charge on.

Data	Data	Data

	Collection	Management	Summarization	Analysis	Reporting

BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for Normal Theory Regression - version 16 Page 7 of 26

Remarks

• • •
• • •

•	performs better in explaining variability in calls than does	Y	= average # calls
•	From this output, the analysis of variance is the following:	Y	= average # calls

1

MSS =	n å	(		-	Y	)	2	= 100,233,719

MSS/1
= 100,233,719

TSS =	i = 1	(	Y i	-		)	2	= 195,747,315

Data	Data	Data

	Collection	Management	Summarization	Analysis	Reporting

4. Model Examination
.

. * Scatterplot with overlay fit and overlay 95% confidence band
. * Tip! – Because of layering: confidence interval first, then fit, then data points . ***** graph twoway (scatter YVARIABLE XVARIABLE, symbol(d)) (lfit YVARIABLE XVARIABLE) (lfitci YVARIABLE XVARIABLE), title("TITLE") subtitle("TITLE")
. graph twoway (lfitci calls low) (lfit calls low) (scatter calls low, symbol(d)), title("Calls to NY Auto Club 1993-1994") subtitle("95% Confidence Bands")

Remarks	•	The overlay of the straight line fit is reasonable but substantial variability is
	•
	•
	•

Data	Data	Data	Statistical

Design	Collection	Management	Summarization	Analysis	Reporting

Not bad actually!

Data	Data	Data	Statistical

	Collection	Management	Summarization	Analysis	Reporting

•	For straight line regression, the suggestion is to regard Cook’s Distance
•	values > 1 as significant..
•
•
•		Data	Data

Design	Collection	Management	Summarization	Analysis	Reporting

BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for Normal Theory Regression - version 16 Page 11 of 26

Remarks	•

•	deleting that individual from the analysis. Departures of this plot from a parallel band about the horizontal line at zero are

	Collection	Management	Summarization	Analysis	Reporting

BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for Normal Theory Regression - version 16 Page 12 of 26

II – Multiple Linear Regression

. ***** Just to be safe! save the ers.dta data again . save "ers.dta", replace
file ers.dta saved

. ***** Clear the workspace
. ***** Command is clear
. clear

Data	Data	Data

	Collection	Management	Summarization	Analysis	Reporting

BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for Normal Theory Regression - version 16 Page 13 of 26

Data are complete; n=67 for every variable. Y=p53 has a limited range, so that the assumption of normality is a bit dicey, but we’ll proceed anyway. Current age (agecurr) ranges 15 to 75.

. *
. ***** Pairwise correlations for all the variables
. pwcorr p53 pregnum agefirst agecurr menop, star(0.05) sig

Data	Data	Data	Statistical

Design	Collection	Management	Summarization	Analysis	Reporting

Data	Data	Data	Statistical

Design	Collection	Management	Summarization	Analysis	Reporting

This does not look linear. So we will create dummies for age at 1st pregnancy.

Design	Data	Data	Data	Statistical	Reporting
Design	Collection	Management	Summarization	Analysis	Reporting

Data	Data	Data

Design	Collection	Management	Summarization	Analysis	Reporting

BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for Normal Theory Regression - version 16 Page 17 of 26

. tab2 agefirst early
-> tabulation of agefirst by early

Age at 1st | early
Pregnancy | 0 1 | Total
---------------+----------------------+----------
never pregnant | 16 0 | 16
age le 24 | 0 32 | 32
age > 24 | 19 0 | 19
---------------+----------------------+----------
Total | 35 32 | 67

Ditto. The new variable, late, is well defined.

. label variable early "Age le 24"
. label variable late "Age gt 24"

Data	Data	Data

	Collection	Management	Summarization	Analysis	Reporting

. * ----------------------------------------------------------------------------------- . * Model Estimation Set I: Determination of best model in the predictors of interest. . * Goal is to obtain best parameterization before considering covariates.

. *-------------------------------------------------------------------------------------

NOTE!! We see a consequence of the multi-collinearity of our predictors [early, late], pregnum [early, late] have NON-significant t-statistic p-values: early and late
pregnum has a t-statistic p-value that is only marginally significant.

.*
.***** 2 df Partial F-test ( Null: [early, late] are not significant, controlling for pregnum).

Data	Data	Data	Statistical

Design	Collection	Management	Summarization	Analysis	Reporting

Marginally statistically significant (p=value = .0656). The null hypothesis is rejected. Conclude that, in the model that contains [early, late], pregnum is marginally statistically significantly associated with Y=p53.

.*
.***** Save results from model above to “model1” for tabulation later.

------------------------------------------------------------------------------ p53 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------- pregnum | .4152523 .1045572 3.97 0.000 .2064372 .6240675 _cons | 2.563537 .2087239 12.28 0.000 2.146687 2.980388 ------------------------------------------------------------------------------ The fitted line is p53 = 2.56 + (0.41)*pregnum.

19.5% of the variability in Y=p53 is explained by this model (R-squared = .1953)
This model is statistically significantly more explanatory that the null model (p-value = .0002)

Data	Data	Data

	Collection	Management	Summarization	Analysis	Reporting

BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for Normal Theory Regression - version 16 Page 20 of 26

. *
. ***** Regression of Y=p53 on design variables [early, late] only. pregnum dropped.

. eststo model3

.*
.***** SUMMARY of Model Estimation Set I.

Choose model “(2)” as a good “minimally adequate” model: Y=p53 and X=pregnum. This is why.

(1) Model “(1)” is the maximal model. R-squared = .20
(2) Model “(2)” drops [early,late]. R-squared is minimally lower: R-squared = .195 (3) Model “(3)” drops pregnum. R-square drop is more substantial: R-squared = .159

Data	Data	Data

	Collection	Management	Summarization	Analysis	Reporting