BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for
Normal Theory Regression - version 16 Page 1
of 26
Unit 9
Setting:
Calls to the New York Auto Club are possibly related to the weather,
with more calls occurring during bad weather. This example illustrates
descriptive analyses and simple linear regression to explore this
hypothesis in a data set containing information on calendar day,
weather, and numbers of calls.
Stata Data Set:
ers.dta
In this illustration, the data set ers.dta is
accessed from the BIOSTATS 690C course website directly. It is then
saved to your current working directory.
. ***** Save the inputted data to the directory you have
chosen above . ***** Command is save “NAME”, replace
. save "ers.dta", replace
(note: file ers.dta not found)
file ers.dta saved
. * Describe data set
. codebook, compact
Variable Obs Unique Mean Min Max Label
stats | low calls
---------+--------------------
N | 28 28
mean | 21.75 4318.75
sd | 13.27383 2692.564
min | -2 1674
max | 41 8947
------------------------------
.
Collection |
Management |
Summarization |
Analysis |
Variable | Obs W V z Prob>z
-------------+-------------------------------------------------- calls |
28 0.82916 5.159 3.378 0.00037
The null hypothesis of normality of Y=calls is rejected (p-value
= .00037). Tip- sometimes the cure is worse than the original violation.
For now, we’ll charge on.
BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for
Normal Theory Regression - version 16 Page 7
of 26
Remarks
From this output, the analysis of variance is the
following: |
4. Model Examination
.
. * Scatterplot with overlay fit and overlay 95% confidence
band
. * Tip! – Because of layering: confidence interval first,
then fit, then data points . ***** graph twoway (scatter YVARIABLE
XVARIABLE, symbol(d)) (lfit YVARIABLE XVARIABLE) (lfitci YVARIABLE
XVARIABLE), title("TITLE") subtitle("TITLE")
. graph twoway (lfitci calls low) (lfit calls low) (scatter calls low,
symbol(d)), title("Calls to NY Auto Club 1993-1994") subtitle("95%
Confidence Bands")
Not bad actually!
• |
values > 1 as significant..
|
• |
|
BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for
Normal Theory Regression - version 16 Page 11
of 26
BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for
Normal Theory Regression - version 16 Page 12
of 26
II – Multiple Linear Regression
. ***** Just to be safe! save the ers.dta data again
. save "ers.dta", replace
file ers.dta saved
. ***** Clear the workspace
. ***** Command is clear
. clear
BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for
Normal Theory Regression - version 16 Page 13
of 26
Data are complete; n=67 for every variable. Y=p53 has a
limited range, so that the assumption of normality is a bit dicey, but
we’ll proceed anyway. Current age (agecurr) ranges 15 to
75.
. *
. ***** Pairwise correlations for all the
variables
. pwcorr p53 pregnum agefirst agecurr menop, star(0.05) sig
This does not look linear. So we will create dummies for age
at 1st pregnancy.
Collection |
Management |
Summarization |
Analysis |
BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for
Normal Theory Regression - version 16 Page 17
of 26
. tab2 agefirst early
-> tabulation of agefirst by early
Age at 1st | early
Pregnancy | 0 1 | Total
---------------+----------------------+----------
never pregnant | 16 0 | 16
age le 24 | 0 32 | 32
age > 24 | 19 0 | 19
---------------+----------------------+----------
Total | 35 32 | 67
Ditto. The new variable, late, is well defined.
. label variable early "Age le 24"
. label variable late "Age gt 24"
. *
-----------------------------------------------------------------------------------
. * Model Estimation Set I: Determination of best model in the
predictors of interest. . * Goal is to obtain best parameterization
before considering covariates.
.
*-------------------------------------------------------------------------------------
NOTE!! We see a consequence of the multi-collinearity of our
predictors [early, late], pregnum [early, late] have NON-significant
t-statistic p-values: early and late
pregnum has a t-statistic p-value that is only marginally
significant.
.*
.***** 2 df Partial F-test ( Null: [early, late] are not
significant, controlling for pregnum).
Marginally statistically significant (p=value = .0656). The
null hypothesis is rejected. Conclude that, in the model that contains
[early, late], pregnum is marginally statistically significantly
associated with Y=p53.
.*
.***** Save results from model above to “model1” for tabulation
later.
------------------------------------------------------------------------------
p53 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pregnum | .4152523 .1045572 3.97 0.000 .2064372 .6240675 _cons |
2.563537 .2087239 12.28 0.000 2.146687 2.980388
------------------------------------------------------------------------------
The fitted line is p53 = 2.56 + (0.41)*pregnum.
19.5% of the variability in Y=p53 is explained by this model
(R-squared = .1953)
This model is statistically significantly more explanatory that
the null model (p-value = .0002)
BIOSTATS 690C – Fall 2020 9. SUPPLEMENT: Stata for
Normal Theory Regression - version 16 Page 20
of 26
. *
. ***** Regression of Y=p53 on design variables [early, late]
only. pregnum dropped.
. eststo model3
.*
.***** SUMMARY of Model Estimation Set I.
Choose model “(2)” as a good “minimally adequate” model:
Y=p53 and X=pregnum. This is why.
(1) Model “(1)” is the maximal model. R-squared =
.20
(2) Model “(2)” drops [early,late]. R-squared is minimally
lower: R-squared = .195 (3) Model “(3)” drops pregnum. R-square drop is
more substantial: R-squared = .159