One and Two (or
more) Sample Hypothesis Testing Paper. Using data from one of the data sets
available through the “Data Sets” link on your
page, develop one business research question from which you will
formulate a research hypothesis to test one population parameter and another to
test two (or more) population parameters.
Formulate both a numerical and verbal hypothesis statement regarding
each of your research issue.
Perform
hypotheses tests using the five step model. Describe and interpret the results
of the test, both in statistical terms and in conversational English. Include
appropriate descriptive statistics.
Solution:
Research question: To find whether there is a
significant difference between wins and salary of the baseball players.
There are two leagues denoted as
1 if American League and
0 if National League
We have separated the data set as
Data
set
American
League:
|
Salary |
Salary -mil |
Wins |
|
123505125.0 |
123.5 |
95.0 |
|
208306817.0 |
208.3 |
95.0 |
|
55425762.0 |
55.4 |
88.0 |
|
73914333.0 |
73.9 |
74.0 |
|
97725322.0 |
97.7 |
95.0 |
|
41502500.0 |
41.5 |
93.0 |
|
75178000.0 |
75.2 |
99.0 |
|
45719500.0 |
45.7 |
80.0 |
|
56186000.0 |
56.2 |
83.0 |
|
29679067.0 |
29.7 |
67.0 |
|
55849000.0 |
55.8 |
79.0 |
|
69092000.0 |
69.1 |
71.0 |
|
87754334.0 |
87.8 |
69.0 |
|
36881000.0 |
36.9 |
56.0 |
Claim: There is a
significant difference between wins and salary- mil of the baseball players in
American League.
Hypotheses:
Null Hypothesis:
Numerical Null Hypothesis:
Verbal Null Hypothesis:
There is no
significant difference between wins and salary- mil of the baseball players in
American League.
Alternative Hypothesis:
Numerical Alternative Hypothesis:
Verbal Alternative Hypothesis:
There is a significant
difference between wins and salary- mil of the baseball players in American
League.
Level
of Significance:
α = 0.05
Decision
rule:
If the p value is greater than the given
level of significance we may accept the null hypothesis. Otherwise reject the
null hypothesis.
Test
Statistic:
Using Megastat in Microsoft Excel Add-
Ins:
Add- Ins ŕ MegastatŕHypothesis tests
ŕ
Compare two independent groups
|
Hypothesis Test: Independent
Groups (t-test, pooled variance) |
||||||
|
Salary
-mil |
Wins |
|||||
|
75.479
|
81.71
|
mean |
||||
|
45.930
|
13.07
|
std. dev. |
||||
|
14 |
14 |
n |
||||
|
26 |
df |
|||||
|
-6.2357
|
difference (Salary -mil - Wins) |
|||||
|
1,140.1793
|
pooled variance |
|||||
|
33.7665
|
pooled std. dev. |
|||||
|
12.7626
|
standard error of difference |
|||||
|
0 |
hypothesized difference |
|||||
|
-0.49 |
t |
|||||
|
.6292 |
p-value (two-tailed) |
|||||
The test statistic value is -0.49.
The p value for the test statistic is
0.6292.
Conclusion:
Since the p value of test statistic is greater
than 0.05 level of significance we may accept the null hypothesis H0
at 5% level of significance. Hence, we conclude that there is no significant
difference between wins and salary- mil of the baseball players in American
League.
Research question: To find whether there is a
significant difference between wins and salary of the baseball players.
Data set
National League:
|
Salary |
Salary -mil |
Wins |
|
86457302.0 |
86.5 |
90.0 |
|
62329166.0 |
62.3 |
77.0 |
|
76799000.0 |
76.8 |
89.0 |
|
61892583.0 |
61.9 |
73.0 |
|
101305821.0 |
101.3 |
83.0 |
|
38133000.0 |
38.1 |
67.0 |
|
83039000.0 |
83.0 |
71.0 |
|
63290833.0 |
63.3 |
82.0 |
|
48581500.0 |
48.6 |
81.0 |
|
90199500.0 |
90.2 |
75.0 |
|
92106833.0 |
92.1 |
100.0 |
|
60408834.0 |
60.4 |
83.0 |
|
95522000.0 |
95.5 |
88.0 |
|
39934833.0 |
39.9 |
81.0 |
|
87032933.0 |
87.0 |
79.0 |
|
48155000.0 |
48.2 |
67.0 |
Claim: There is a significant difference
between wins and salary- mil of the baseball players in National League.
Hypotheses:
Null Hypothesis:
Numerical Null Hypothesis:
Verbal Null Hypothesis:
There is no
significant difference between wins and salary- mil of the baseball players in National
League.
Alternative Hypothesis:
Numerical Alternative Hypothesis:
Verbal Alternative Hypothesis:
There is a significant
difference between wins and salary- mil of the baseball players in National
League.
Level
of Significance:
α = 0.05
Decision
rule:
If the p value is greater than the given
level of significance we may accept the null hypothesis. Otherwise reject the
null hypothesis.
Test
Statistic:
Using Megastat in Microsoft Excel Add-
Ins:
Add- Ins ŕ MegastatŕHypothesis tests
ŕ
Compare two independent groups
|
Hypothesis Test: Independent
Groups (t-test, pooled variance) |
|||||
|
Salary
-mil |
Wins |
||||
|
70.949
|
80.375
|
mean |
|||
|
20.669
|
8.831
|
std. dev. |
|||
|
16 |
16 |
n |
|||
|
30 |
df |
||||
|
-9.4257
|
difference (Salary -mil - Wins) |
||||
|
252.5883
|
pooled variance |
||||
|
15.8930
|
pooled std. dev. |
||||
|
5.6190
|
standard error of difference |
||||
|
0 |
hypothesized difference |
||||
|
-1.68 |
t |
||||
|
.1038 |
p-value (two-tailed) |
||||
The test statistic value is -1.68.
The p value for the test statistic is 0.1038.
Conclusion:
Since the p
value of test statistic is greater than 0.05 level of significance we may
accept the null hypothesis H0 at 5% level of significance. Hence, we
conclude that there is no significant difference between wins and salary- mil
of the baseball players in National League.
Regression analysis:
The general
multiple regression is given by
![]()
where, y
is the dependent variable,
’s are independent variable,
is the actual
constant,
is the actual
coefficient associated with ith independent variable,
is the error term
which models the unsystematic error of the y
The above model can be written in matrix form as
![]()
The General Goal of multiple regression is to determine which independent (explanatory) variables should be included in the model.
We want to first test each coefficient,
where i=1,2,...,k,
within the model, in order to determine if that individual parameter should be
dropped from the model.
Next we test the goodness of fit of the model.
Hypothesis Tests:
![]()
Procedure:
First we estimate the model as
![]()
where,
is the estimated value
of
and
.
For Testing Each
:
The test statistic is given by
![]()
where,
is the standard error
of the estimated coefficient
.
Goodness of fit test:
In order to test the goodness of fit test we generally compute R2, which lies between 0 and 1. As R2 tends to 1, we can say that the model is suitable for the data i.e. the model can explain the data very well.
Dependent variable:
X7- Wins
Independent variables:
X2- League
X3- Built
X4- Size
X5- Surface
X6- Salary- mil
X8- Attendance
X9- Batting
X10- ERA
X11- HR
X12- Error
X13- SB
Using Megastat in Microsoft Excel Add-
Ins:
Add- Ins ŕ MegastatŕCorrelation/
Regression ŕ Regression analysis
|
Regression Analysis |
||||||
|
R˛
|
0.857 |
|
|
|||
|
Adjusted
R˛ |
0.770 |
n |
30 |
|||
|
R |
0.926 |
k |
11 |
|||
|
Std.
Error |
5.200 |
Dep.
Var. |
Wins |
|||
|
ANOVA table |
||||||
|
Source |
SS |
df |
MS |
F |
p-value |
|
|
Regression |
2,917.2794 |
11 |
265.2072
|
9.81 |
1.64E-05 |
|
|
Residual |
486.7206 |
18 |
27.0400
|
|
|
|
|
Total |
3,404.0000 |
29 |
|
|
|
|
|
Regression output |
confidence
interval |
|||||
|
variables |
coefficients |
std.
error |
t (df=18) |
p-value |
95%
lower |
95%
upper |
|
Intercept |
74.6634
|
133.9145
|
0.558 |
.5840 |
-206.6805
|
356.0073
|
|
League |
-1.2494
|
2.3275
|
-0.537 |
.5980 |
-6.1392
|
3.6404
|
|
Built |
-0.0274
|
0.0558
|
-0.491 |
.6291 |
-0.1447
|
0.0899
|
|
Size |
-0.00000401
|
0.00020556
|
-0.019 |
.9847 |
-0.00043588
|
0.00042787
|
|
Surface |
0.5761
|
4.3135
|
0.134 |
.8952 |
-8.4863
|
9.6384
|
|
Salary
-mil |
0.0411
|
0.0667
|
0.615 |
.5462 |
-0.0992
|
0.1813
|
|
Attendance |
-0.00000085
|
0.00000317
|
-0.267 |
.7923 |
-0.00000750
|
0.00000581
|
|
Batting |
447.7443
|
200.5131
|
2.233 |
.0385 |
26.4819
|
869.0067
|
|
ERA |
-13.6362
|
2.4171
|
-5.642 |
2.37E-05 |
-18.7143
|
-8.5581
|
|
HR |
0.0930
|
0.0338
|
2.755 |
.0130 |
0.0221
|
0.1639
|
|
Error |
-0.1601
|
0.1246
|
-1.285 |
.2151 |
-0.4218
|
0.1017
|
|
SB |
0.0152
|
0.0361
|
0.422 |
.6777 |
-0.0605
|
0.0910
|
The
regression equation is
Wins = 74.6634 - 1.2494 League - 0.0274 Built - 0.00000401 Size + 0.5761 Surface + 0.0411 Salary -mil - 0.00000085 Attendance + 447.7443 Batting -13.6362 ERA + 0.0930 HR -0.1601 Error + 0.0152 SB
The R-Sq(adj.)
value is high. So the model has good fit. But the p-values for x2, x3, x4, x5,
x6, x12 and x13 are greater than 0.05. So these coefficients are insignificant.
There is thus a multicollinearity problem. So we drop these variables and
regress x7 on x9, x10 and x11.
Regression Analysis: x7 versus x9,
x10, x11
Dependent variable:
X7- Wins
Independent variables:
X9- Batting
X10- ERA
X11- HR
Using Megastat in Microsoft Excel Add-
Ins:
Add- Ins ŕ MegastatŕCorrelation/
Regression ŕ Regression analysis
|
Regression Analysis |
||||||
|
R˛
|
0.810 |
|
|
|||
|
Adjusted
R˛ |
0.788 |
n |
30 |
|||
|
R |
0.900 |
k |
3 |
|||
|
Std.
Error |
4.988 |
Dep.
Var. |
Wins |
|||
|
ANOVA table |
||||||
|
Source |
SS |
df |
MS |
F |
p-value |
|
|
Regression |
2,757.1594 |
3 |
919.0531
|
36.94 |
1.60E-09 |
|
|
Residual |
646.8406 |
26 |
24.8785
|
|
|
|
|
Total |
3,404.0000 |
29 |
|
|
|
|
|
Regression output |
confidence
interval |
|||||
|
variables |
coefficients |
std.
error |
t (df=26) |
p-value |
95%
lower |
95%
upper |
|
Intercept |
1.8499
|
35.0214
|
0.053 |
.9583 |
-70.1376
|
73.8374
|
|
Batting |
492.4490
|
140.3025
|
3.510 |
.0017 |
204.0532
|
780.8449
|
|
ERA |
-15.9575
|
1.6753
|
-9.525 |
5.78E-10 |
-19.4011
|
-12.5139
|
|
HR |
0.1035
|
0.0289
|
3.582 |
.0014 |
0.0441
|
0.1628
|
The
regression equation is
Wins = 1.8499 + 492.4490 Batting -15.9575 ERA + 0.1035 HR
Here all the p
values of the coefficients are less than 0.05 i.e.
are significant at 5 %
level of significance. The R2 value is slightly reduced after
dropping the variables and it is of not that much effect and hence the model is
good.
Correlation:
Research
question: To find whether salary have relationship with Attendance of the
baseball players.
There are two leagues denoted as
1 if American League and
0 if National League
We have separated the data set as
Data set
American League:
|
Salary -mil |
Attendance |
|
123.5 |
2,847,798 |
|
208.3 |
4,090,440 |
|
55.4 |
2,108,818 |
|
73.9 |
2,623,904 |
|
97.7 |
3,404,636 |
|
41.5 |
2,014,220 |
|
75.2 |
2,342,804 |
|
45.7 |
2,014,995 |
|
56.2 |
2,034,243 |
|
29.7 |
1,141,915 |
|
55.8 |
2,525,259 |
|
69.1 |
2,024,505 |
|
87.8 |
2,724,859 |
|
36.9 |
1,371,181 |
Using Megastat in Microsoft Excel Add-
Ins:
Add- Ins ŕ MegastatŕCorrelation/
Regression ŕ Correlation Matrix
|
Correlation Matrix |
|||||
|
Salary
-mil |
Attendance |
||||
|
Salary
-mil |
1.000 |
|
|||
|
Attendance |
.895 |
1.000 |
|||
|
14 |
sample size |
||||
The correlation coefficient between
salary- mil and attendance is 0.895. there is a strong positive correlation
exist between the variables.
Null
Hypothesis:
H0: ρ=0
H0: “no
linear relationship” between the variables.
Alternative
Hypothesis:
H1: ρ≠0
H1:“ linear
relationship” between the variables.
Level of significance:
α
= 0.05
Critical value:
At
5% level of significance t distribution with v = 14 - 2 degrees of freedom
is 2.178813
Test
statistic:
Under ![]()
has a t distribution with v = n-2 degrees of freedom.
r=
0.895 and n = 14
![]()
![]()
![]()
![]()
Conclusion:
Since
the test statistic value is greater than the critical value there is no
evidence to accept the null hypothesis at 5% level of significance. Hence we
conclude that there is a relationship exist between the variables salary- mil and
attendance.
Data set
National League:
|
Salary -mil |
Attendance |
|
86.5 |
2,520,904 |
|
62.3 |
2,059,327 |
|
76.8 |
2,805,060 |
|
61.9 |
1,923,254 |
|
101.3 |
2,827,549 |
|
38.1 |
1,817,245 |
|
83 |
3,603,680 |
|
63.3 |
2,869,787 |
|
48.6 |
2,730,352 |
|
90.2 |
3,181,020 |
|
92.1 |
3,542,271 |
|
60.4 |
1,852,608 |
|
95.5 |
2,665,304 |
|
39.9 |
2,211,323 |
|
87 |
3,100,092 |
|
48.2 |
1,914,385 |
Using Megastat in Microsoft Excel Add-
Ins:
Add- Ins ŕ MegastatŕCorrelation/
Regression ŕ Correlation Matrix
|
Correlation Matrix |
|||
|
Salary
-mil |
Attendance |
||
|
Salary
-mil |
1.000 |
|
|
|
Attendance |
.693 |
1.000 |
|
|
16 |
sample size |
||
The correlation coefficient between
salary- mil and attendance is 0.693. There is a strong positive correlation
exist between the variables.
Null
Hypothesis:
H0: ρ=0
H0: “no
linear relationship” between the variables.
Alternative
Hypothesis:
H1: ρ≠0
H1:“ linear
relationship” between the variables.
Level of significance:
α
= 0.05
Critical value:
At
5% level of significance t distribution with v = 16 - 2 degrees of freedom
is 2.144787
Test
statistic:
Under ![]()
has a t distribution with v = n-2 degrees of freedom.
r=
0.693 and n = 16
![]()
![]()
![]()
![]()
Conclusion:
Since
the test statistic value is greater than the critical value there is no
evidence to accept the null hypothesis at 5% level of significance. Hence we
conclude that there is a relationship exist between the variables salary- mil and
attendance.
Descriptive Statistics:
Using Megastat in Microsoft Excel Add-
Ins:
Add- Ins ŕ MegastatŕDescriptive
Statistics
|
|
Salary
-mil |
Wins
|
Attendance
|
Batting
|
ERA
|
HR
|
Error
|
SB
|
|
count |
30
|
30
|
30
|
30
|
30
|
30
|
30
|
30
|
|
mean |
73.064
|
81.000
|
2,496,457.93
|
0.26443
|
4.2847
|
167.23
|
102.00
|
85.50
|
|
sample variance |
1,171.965
|
117.379
|
452,766,738,769.44
|
0.00005
|
0.3206
|
1,225.29
|
130.34
|
1,075.43
|
|
sample standard deviation |
34.234
|
10.834
|
672,879.44
|
0.00728
|
0.5662
|
35.00
|
11.42
|
32.79
|
|
minimum |
29.679067
|
56
|
1141915
|
0.252
|
3.49
|
117
|
86
|
31
|
|
maximum |
208.30682
|
100
|
4090440
|
0.281
|
5.49
|
260
|
125
|
161
|
|
range |
178.62775
|
44
|
2948525
|
0.029
|
2
|
143
|
39
|
130
|
|
|
|
|
|
|
|
|
|
|
|
1st quartile |
50.293
|
73.250
|
2,017,372.50
|
0.25900
|
3.7875
|
136.75
|
92.50
|
65.25
|
|
median |
66.191
|
81.000
|
2,523,081.50
|
0.26400
|
4.2000
|
164.00
|
102.50
|
76.00
|
|
3rd quartile |
87.574
|
88.750
|
2,842,735.75
|
0.27000
|
4.5500
|
190.50
|
108.75
|
101.25
|
|
interquartile range |
37.281
|
15.500
|
825,363.25
|
0.01100
|
0.7625
|
53.75
|
16.25
|
36.00
|
|
mode |
#N/A |
95.000
|
#N/A |
0.27000
|
3.6100
|
130.00
|
106.00
|
45.00
|
The descriptive statistics for the whole
team is given in the above table.
Inference for our research:
- From
the analysis of comparing two independent groups we obtain the result as
there is no significant difference between wins and salary- mil of the
baseball players in American League.
- From
the analysis of comparing two independent groups we obtain the result as
there is no significant difference between wins and salary- mil of the
baseball players in National League.
- From the
regression analysis we obtained the regression equation predicting the
wins is
Wins = 1.8499 + 492.4490 Batting -15.9575 ERA + 0.1035 HR
·
From the correlation analysis we obtained the result as there is a
relationship exists between the variables salary- mil and attendance of baseball
players in American League.
- From the correlation analysis we obtained the result as there
is a relationship exists between the variables salary- mil
and attendance of the baseball players in National League.
