Language:EN
Pages: 26
Rating : ⭐⭐⭐⭐⭐
Price: $10.99
Page 1 Preview
checking model assumptions and fit multiple linear

Checking model assumptions and fit multiple linear regression

BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for Normal Theory Regression - version 16 Page 1 of 26

Unit 9

I- Simple Linear Regression ………….………….…………………….. 1. Introduction to Example …………………..………………………. 2. Preliminaries: Descriptives ………………….……………………. 3. Model Fitting (Estimation) ………………………………………… 4. Model Examination ………………………………………………… 5. Checking Model Assumptions and Fit …………….………………..

II – Multiple Linear Regression ………..……………………………….. 1. Introduction to Example ………………………………..………….. 2. Preliminaries: Descriptives ………………………...…..………….. 3. Handling of Categorical Predictors: Indicator Variables ………….. 4. Model Fitting (Estimation) …………………………………………. 5. Checking Model Assumptions and Fit ……………………………

Data Data Data

Statistical

Design

Collection Management Summarization Analysis Reporting

Setting:
Calls to the New York Auto Club are possibly related to the weather, with more calls occurring during bad weather. This example illustrates descriptive analyses and simple linear regression to explore this hypothesis in a data set containing information on calendar day, weather, and numbers of calls.

Stata Data Set:
ers.dta
In this illustration, the data set ers.dta is accessed from the BIOSTATS 690C course website directly. It is then saved to your current working directory.

. ***** Save the inputted data to the directory you have chosen above . ***** Command is save “NAME”, replace
. save "ers.dta", replace
(note: file ers.dta not found)
file ers.dta saved

Data Data Data

Statistical

Collection Management Summarization Analysis Reporting

. * Describe data set
. codebook, compact

Variable Obs Unique Mean Min Max Label

stats | low calls
---------+--------------------
N | 28 28
mean | 21.75 4318.75
sd | 13.27383 2692.564
min | -2 1674
max | 41 8947
------------------------------

.

Data Data Data Statistical Reporting
Collection Management Summarization Analysis
Data Data Data

Statistical

Design

Collection Management Summarization Analysis Reporting
Data Data Data

Statistical

Design

Collection Management Summarization Analysis Reporting

Variable | Obs W V z Prob>z -------------+-------------------------------------------------- calls | 28 0.82916 5.159 3.378 0.00037

The null hypothesis of normality of Y=calls is rejected (p-value = .00037). Tip- sometimes the cure is worse than the original violation. For now, we’ll charge on.

Data Data Data
Collection Management Summarization Analysis Reporting

BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for Normal Theory Regression - version 16 Page 7 of 26

Remarks



performs better in explaining variability in calls than does

Y = average # calls
From this output, the analysis of variance is the following:

Source

1

MSS =

n
å

(
-

Y

) 2 = 100,233,719

MSS/1
= 100,233,719

RSS =
( Y i -

Y ˆ

i

) 2 = 95,513,596.2
TSS =

i = 1

( Y i -

) 2 = 195,747,315
Data Data Data
Collection Management Summarization Analysis Reporting

4. Model Examination
.

. * Scatterplot with overlay fit and overlay 95% confidence band
. * Tip! – Because of layering: confidence interval first, then fit, then data points . ***** graph twoway (scatter YVARIABLE XVARIABLE, symbol(d)) (lfit YVARIABLE XVARIABLE) (lfitci YVARIABLE XVARIABLE), title("TITLE") subtitle("TITLE")
. graph twoway (lfitci calls low) (lfit calls low) (scatter calls low, symbol(d)), title("Calls to NY Auto Club 1993-1994") subtitle("95% Confidence Bands")

Remarks The overlay of the straight line fit is reasonable but substantial variability is
Data Data Data

Statistical

Design

Collection Management Summarization Analysis Reporting

Not bad actually!

Data Data Data

Statistical

Collection Management Summarization Analysis Reporting
Remarks

For straight line regression, the suggestion is to regard Cook’s Distance

values > 1 as significant..

Data Data

Design

Collection Management Summarization Analysis Reporting

BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for Normal Theory Regression - version 16 Page 11 of 26

Remarks

deleting that individual from the analysis.

Departures of this plot from a parallel band about the horizontal line at zero are

Collection Management Summarization Analysis Reporting

BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for Normal Theory Regression - version 16 Page 12 of 26

II – Multiple Linear Regression

. ***** Just to be safe! save the ers.dta data again . save "ers.dta", replace
file ers.dta saved

. ***** Clear the workspace
. ***** Command is clear
. clear

Data Data Data
Collection Management Summarization Analysis Reporting

BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for Normal Theory Regression - version 16 Page 13 of 26

Data are complete; n=67 for every variable. Y=p53 has a limited range, so that the assumption of normality is a bit dicey, but we’ll proceed anyway. Current age (agecurr) ranges 15 to 75.

. *
. ***** Pairwise correlations for all the variables
. pwcorr p53 pregnum agefirst agecurr menop, star(0.05) sig

Data Data Data

Statistical

Design

Collection Management Summarization Analysis Reporting
Data Data Data

Statistical

Design

Collection Management Summarization Analysis Reporting

This does not look linear. So we will create dummies for age at 1st pregnancy.

Design

Data Data Data Statistical Reporting
Collection Management Summarization Analysis
Data Data Data

Design

Collection Management Summarization Analysis Reporting

BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for Normal Theory Regression - version 16 Page 17 of 26

. tab2 agefirst early
-> tabulation of agefirst by early

Age at 1st | early
Pregnancy | 0 1 | Total
---------------+----------------------+----------
never pregnant | 16 0 | 16
age le 24 | 0 32 | 32
age > 24 | 19 0 | 19
---------------+----------------------+----------
Total | 35 32 | 67

Ditto. The new variable, late, is well defined.

. label variable early "Age le 24"
. label variable late "Age gt 24"

Data Data Data
Collection Management Summarization Analysis Reporting

. * ----------------------------------------------------------------------------------- . * Model Estimation Set I: Determination of best model in the predictors of interest. . * Goal is to obtain best parameterization before considering covariates.

. *-------------------------------------------------------------------------------------

NOTE!! We see a consequence of the multi-collinearity of our predictors [early, late], pregnum [early, late] have NON-significant t-statistic p-values: early and late
pregnum has a t-statistic p-value that is only marginally significant.

.*
.***** 2 df Partial F-test ( Null: [early, late] are not significant, controlling for pregnum).

Data Data Data

Statistical

Design

Collection Management Summarization Analysis Reporting

Marginally statistically significant (p=value = .0656). The null hypothesis is rejected. Conclude that, in the model that contains [early, late], pregnum is marginally statistically significantly associated with Y=p53.

.*
.***** Save results from model above to “model1” for tabulation later.

------------------------------------------------------------------------------ p53 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------- pregnum | .4152523 .1045572 3.97 0.000 .2064372 .6240675 _cons | 2.563537 .2087239 12.28 0.000 2.146687 2.980388 ------------------------------------------------------------------------------ The fitted line is p53 = 2.56 + (0.41)*pregnum.

19.5% of the variability in Y=p53 is explained by this model (R-squared = .1953)
This model is statistically significantly more explanatory that the null model (p-value = .0002)

Data Data Data
Collection Management Summarization Analysis Reporting

BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for Normal Theory Regression - version 16 Page 20 of 26

. *
. ***** Regression of Y=p53 on design variables [early, late] only. pregnum dropped.

. eststo model3

.*
.***** SUMMARY of Model Estimation Set I.

Choose model “(2)” as a good “minimally adequate” model: Y=p53 and X=pregnum. This is why.

(1) Model “(1)” is the maximal model. R-squared = .20
(2) Model “(2)” drops [early,late]. R-squared is minimally lower: R-squared = .195 (3) Model “(3)” drops pregnum. R-square drop is more substantial: R-squared = .159

Data Data Data
Collection Management Summarization Analysis Reporting

You are viewing 1/3rd of the document.Purchase the document to get full access instantly

Immediately available after payment
Both online and downloadable
No strings attached
How It Works
Login account
Login Your Account
Place in cart
Add to Cart
send in the money
Make payment
Document download
Download File
img

Uploaded by : Prisha Khanna

PageId: ELI188497F