8.
CORRELATION
AND REGRESSION
8.1BIVARIATE DISTRIBUTION:
In a bivariate
distribution we may be interested to find out if there is any correlation or
covariation between the two variables under study. If the changes in one
variable affects a change in the other variable, the variables are said to be
correlated. If the two variables deviate in the same direction, that is if the
increase in one results in a corresponding increase in the other, correlation
is said to be direct or positive. But if they constantly deviate in the
opposite direction, that is if increase in one results in corresponding
decrease in the other, correlation is said to be diverse or negative.
8.2 SCATTER DIAGRAM:
It is simplest
way of the diagrammatic representation of bivariate data. Thus for the
bivariate distribution (xi,yi) ; i=1,2,………n. if the
values of the variables X and Y be plotted along the x-axis and y-axis
respectively in the xy plan, the diagram of dots so obtained is known as
scatter diagram. From the scatter diagram, we can form a fairly good, though
vague, idea whether the variables are correlated or not, e.g., if the points
are very dense, i.e., very close to each other we should expect a fairly good
amount of correlation between the variables and if the points are widely scattered,
a poor correlation is expected. This method, however, is not suitable if the
number of observations is fairly large.
8.3 KARL PEARSON COEFFICIENT OF CORRELATION:
As a measure of
intensity or degree of linear relationship between two variables, karl pearson
a British Biometrician, developed a formula called correlation coefficient.
Correlation
coefficient between two random variable x and Y, usually denoted by r(X,Y) or
simply rXY is a numerical measure of linear relationship between
them and is defined as
![]()
8.4 CALCULATION OF THE CORRELATION COEFFICIENT FOR A
BIVARIATE FREQUENCY DISTRIBUTION:
When the data
are considerably large, they may be summarized by using a two-way table. Hence
for each variable a suitable number of classes are taken, keeping in view the
same considerations as in the univariate case. If there are n classes for X and
m classes for Y, there will be in all m*n cells in the two-way table. By going
through the pairs of values of X and Y, we can find the frequency for each
cell. The whole set of cell frequencies will then define a bivariate frequency
distribution. The column totals and row totals will give us the marginal
distributions of X and Y. A particular column or row will be called the
conditional distribution of Y for given X or of X for given Y respectively.
Suppose that the
bivariate data on X and Y are presented in a two-way correlation table where
there are m classes of Y placed along the horizontal line and n classes of X
along a vertical line and fij is the frequency of individuals lying
in the (I,j)the cell.
Here ![]()
Is the sum of
the frequencies along any row and
![]()
Is the sum of
the frequencies along any column. We observe that
Then, <![]()

8.5 Probable
Error of Correlation Coefficient. If r is the correlation coefficient in a
sample of n pairs of observations, then its standard error is given by
![]()
Probable error
of correlation coefficient is given by
![]()
Probable error
is an old measure for testing the reliability of an observed correlation
coefficient. The reason for taking the factor 0.6745 is that in a normal
distribution, the range µ±0.6745 σ covers 50% of the total area. According
to secrist, “the probable error of the correlation co-efficientis an amount
which if added to and substracted from the mean correlation coefficient of
correlation from a series selected at random will fall.”
If r<
P.E>(r). correlation is not at all significant. If r> 6P.E.(r), it is
definitely significiant. A rigorous method of testing the significance of an
observed correlation coefficient will be discussed later in”test of
significance” in sampling.
Probable error
also enables us to find the limits within which the population correlation can
be expected to vary. The limits are r±P.E.(r).
8.6 RANK CORRELATION:
Let us suppose
that a group of n individuals is arranged in order of merit or proficiency in
possession of two characteristics A and B. these ranks in two characteristics
will, in general, be different. For example, if we consider the relation
between intelligence and beauty, it is not necessary that a beautiful
individual is intelligent also. Let (xi,yi); i=1,2,………,n
be the ranks of the ith individual in two characteristics A and B
respectively. Pearsonian coefficient of
correlation between the ranks xi’s and yi ‘s is called
the rank correlation coefficient between A and B for that group of individuals.
Assuming that no
two individuals are bracketed equal in either classification, each of the
variables X and Y takes the values 1,2,………..,n
Hence

In general xi
≠ yi . Let di
=xi -yi
![]()
Squaring and
summin over I form 1 to n, we get
![]()
![]()
Dividing both
side by n, we get
![]()
Where ![if !vml]>
is the rank correlation coefficient between A and B.
![]()


Which is the
spearman’s formula for the rank correlation coefficient.
8.7 REGRESSION:
The term
“regression“literally mean “stepping back towards the average”. It was first
used by a British biometrician Sir Francis Galton, in connection with the
inheritance of stature. Galton found that the offspring’s of abnormally tall or
short parents tend to “regress” or “step back” to the average population
height. But the term “regression” as now used in statistics is only a
convenient term without having any reference to biometry.
Regression
analysis is a mathematical measure of the average relationship between two or
more variables in terms of the original units of the data.
In regression
analysis there are two types of variables. The variables whose value is
influenced or is to be predicted is called dependent variable and the variable
which influences the values or is used for prediction, is called independent
variable. In regression analysis independent variable is also known as
regressor or predictor or explanatory variable while the dependent variable is
also known as regressed or explained variable.
8.7.1 Lines of regression:
If the variables
in a bivariate distribution are related, we will find the points in the scatter
diagram will cluster round some curve called the “curve of regression”. If the
curve is a straight line, it is called the line of regression and there is said
to be linear regression between the variables, otherwise regression is said to
be curvilinear.
The line of
regression is the line which gives the best estimate to the value of one
variable for any specific value of the other variable. Thus the line of
regression is the line of “best fit” and is obtained by the principle of least
squares.
Let us suppose
that in the bivariate distribution (xi, yi); i=1,2,……..n;
Y is the dependent variable and X is independent variable. Let the line of
regression of Y on X be Y = a+bX.
According to the
principle of least squares, the normal equations for estimating a and b are
----------------------------(1)
And
--------------(2)
From (1) on
dividing by n, we get
---------------------(a)
Thus the lines
of regression of Y on X passes through the point ![]()
Now
![]()
-------------------------------(3)
Also ![]()
--------------------------(4)
Dividing (2)by n
and using (3) and (4) we get
------------------(5)
Multiplying (a)
by
and then subtracting from (5) we get
![]()
Since ‘b’ is the
slope of the line of regression of Y on X and since the line of regression
passes through the point (
,
), its equation is
--------------------(6)
-------------------(7)
Starting with
the equation X = A+BY and proceeding similarly or by simply interchanging the
variables X and Y in (6) and (7), the equation of the line of regression of X
on Y becomes
![]()
![]()
8.7.2
Regression Curves:
In modern terminology, the conditional mean
E(Y|X=x) for a continuous distribution is called the regression function of Y
on X and the graph of this function of x is known as the regression curve of Y
on X or sometimes the regression curve for the mean of Y. Geometrically, the
regression function represents the y co- ordinate of the centre of mass of the
bivariate probability mass in the infinitesimal vertical strip bounded by x and
x+dx.
Similarly, the regression function of X on Y
is E(X|Y=y) and the graph of this function of y is called the regression curve
of X on Y.
In case a regression curve is a straight
line, the corresponding regression is said to be linear. If one of the
regression is linear, it does not however follow that the other is also linear.
8.7.3
Regression coefficients:
‘b’, the slope of the line of regression of
Y on X is also called the coefficient of regression of Y on X. it represents
the increment in the value of dependent variable Y corresponding to a unit
change in the value of independent variable X. More precisely, we write
bYX = Regression coefficient of Y
on X = μ11/σX2 = r*( σY/
σX)
similarly, the coefficient of regression of
X on Y indicates the change in the value of variable X corresponding to a unit
change in the value of variable Y and is given by
bXY = Regression coefficient of X on Y =
μ11/σY2 = r*( σX/
σY)
8.7.4
Properties of Regression Coefficients:
(a) Correlation coefficient is the geometric
mean between the regression coefficients.
(b) If one of the regression coefficients is
greater than unity, the other must be less than unity.
(c) Arithmetic mean of the regression
coefficients is greater than the correlation coefficient r, provided
r > 0.
(d) Regression coefficients are independent
of the changes of origin but not of scale.
8.8 CORRELATION RATIO:
As discussed earlier,
when variables are linearly related, we have the regression lines of one
variable on another variable and correlation coefficient can be computed to
tell us about the extent of association between them. However, if the variables
are not linearly related but some sort of curvilinear relationship exists
between them, the use of r which is a measure of the degree to which the
relation approaches a straight line “law” will be misleading. We might come
across bivariate distributions where r may be very low or even zero but the
regression may be strong, or even zero but the regression may be strong, or
even perfect. Correlation ratio ‘η’ is the appropriate measure of
curvilinear relationship between the two variables. Just as r measures the
concentration of points about the straight line of best fit, η measures
the concentration of points about the curve of best fit. If regression is
linear η=r otherwise η > r.
8.9 INTRA – CLASS CORRELATION:
Intra-class
correlation means within class correlation. It is distinguishable from product
moment correlation in as much as here both the variables measure the same
characteristics. Sometimes specially in biological and agricultural study, it
is of interest to know how the members of a family or group are correlated among
themselves with respect to some one of their common characteristic. For
example, we may require the require the correlation between the heights of
brothers of a family or between yields of plots of an experimental block. In
such cases both the variables measure the same characteristic, e.g., height and
height or weight and weight. There is nothing to distinguish one from the other
so that one may be treated as X-variable and the other as the Y-variable.
Suppose we have
A1, A2, …….,An families with K1, K2,……….,Kn
members, each of which may be represented as
![]()
and
let xij(i=1,2,….n; j=1,2,……,ki) denote the measurement on
the jth member in the ith family.
We
shall have ki(ki-1) pairs for the ith family or group
like (xij, xil),j≠1. There will be
entries for all the n families or groups. The table is
symmetrical about the principal diagonal. Such a table is called an intra-class
correlation table and the correlation is called intra-class correlation.
In
the bivariate table xi1 occurs (ki -1) times, xi2
occurs (ki -1) times,.xiki occurs (ki -1)
times, i.e., from the ith family we have
and hence for
all the n families we have
as the marginal frequency, the table being symmetrical
about principle diagonal.

Similarly,

Further
, ![]()

If we write

![]()
Therefore
intra-class correltion coefficient is given by

If we put ki
= k , i.e., if all families have equal members then


Where σ2
denote the variance of X and σ2m the variance of
means of families.
8.10 BIVARIATE NORMAL DISTRIBUTION:
The bivariate normal distribution
is a generalization of a normal distribution for a single variate. Let X and Y
be two normally correlated variables with correlation coefficient
and
;
,
. In deriving the bivariate normal distribution we
make the following three assumptions.
(i)
The
regression of Y on X is linear. Since the mean of each array is on the line of
regression
, the mean or expected value of Y is
, for different values of X.
(ii)
The
arrays are homoscedastic , i.e., vaiance in each array is same. The common
variance of estimate of Y in each array is then given by
,
being the correlation coefficient between variables X
and Y and is independent of X.
(iii)
The
distribution of Y in different arrays in normal. Suppose that one of the
variates , say X, is distributed normally with mean 0 and standard deviation
σ1 so that the probability that a random. Value of X will fall
in the small interval dx is
![]()
The probability
that a value of Y, take at random in an assigned vertical array will fall in
the interval dy is

The joint
probability differential of X and Y is given by
dp(x,y)
=g(x)h(y/x) dxdy


Shitting the
origin to (µ1,µ2) we get
![]()

Where
are the five parameters of the distribution.
This is the
density function of a bivariate normal distribution. The variables X and Y are
said to be normally correlated and the surface z=f(x,y) is known as the normal
correlation surface. The nature of the normal correlation surface is indicated
in the above diagram.
8.11 MULTIPLE AND PARTIAL CORRELATION:
When
the values of one variable are
associated with or influenced by other variable, e.g., the age of husband and
wife, the height of father and son, the supply and demand of a commodity and so
on, Karl Pearson’s coefficient of correlation can be used as a measure of
linear relationship between them. But sometimes there is interrelation between
many variables and the value of one variable may be influenced by many others,
e.g., the yield of crop per acre say (X1) depends upon quality of
seed (X2) fertility of soil (X3) fertilizer used (X4)
irrigation facilities (X5) weather conditions (X6) and so
on. Whenever we are interested in studying the joint effect of a group of
variables upon a variable not included in that group our study is that of
multiple correlation and multiple regression.
Suppose
in a trivariate or multi-variate distribution we are interested in the
relationship between two variables only. There are two alternatives, viz (i) we
consider only those two members of the observed data in which the other members
have specified values or (ii) we may eliminate mathematically the effect of
other variates on two variates. The first method has the disadvantage that it
limits the size of the data and also it will be applicable to only the data in
which the other variates have assigned values. In the variates but the linear
effect can be easily eliminated. The correlation and regression between only
two variates eliminating the linear effect of other variates in them is called
the partial correlation and partial regression.
8.12
PLANE OF
REGRESSION:
The equation of
the plane of regression of X1 on X2 and X3 is
![]()
The constants
b’s are determined by the principle of
least square ,i.e., minimizing he sum of the squares of he residuals, viz.,
![]()
The summation
being extended to the given values (N in number) of the variables.
The normal
equations for estimating
and
are

and ![]()

Since Xi’s
are measured from their respective means, we have
,![]()

Hence we get

Solving the
equation for b12.3 and b13.2 we get


![]()
Similarly, we
will get
![]()
![]()
![]()
If we write
ω =
![]()
and ωij
is the cofactor of the element in the ith row and jth column of ω .
![]()
![]()
8.13 PROPERTIES IF RESIDUALS:
Property 1:
The sum of the
product of any residual of order zero with any other residual of higher order
is zero, provided the subscript of the former occurs among the secondary subscripts
of the latter.
The normal
equations for estimating b’s in trivariate and n-variate distributions as
obtained in equations are

Respectively.
Here Xi = (i=1,2,3,……………..,n) can be regard as a residual of order
zero. Hence the result.
Property 2:
The sum of the
product of any two residuals in which all the secondary subscripts of the first
occur among the secondary subscripts of the second is unaltered if we omit any
or all of the secondary subscripts of the first. Conversely, the product sum of
any residual of order ‘p’ with a residual of order p+q , the ‘p’ subscripts
being the same in each case is unaltered by adding to the secondary subscripts
of the former any or all the ‘q’ additional subscripts of the latter
Let us consider.
![]()
= ![]()
![]()

![]()
Again ![]()

Hence the
property 2.
Property 3:
The sum of the
product of two residuals is zero if all the subscripts of the one occur among
the secondary subscripts of the other ,
![]()
![]()
![]()
=0
Hence the
property3.
8.14. COEFFICIENT OF MULTIPLE CORRELATION:
In a tri-variate
distribution in which each of the variable X1,X2, and X3
has N observations, the multiple correlation coefficient of X1 on X2
and X3 usually denoted by R 1.23 is the simple
correlation coefficient between X1 and the joint effect of X2
and X3 on X1. In other words R1.23 is the
correlation coefficient between X1 and its estimated value as given
by the plane of regression of X1 on X2 and X3.
![]()
![]()
Since Xi’s
are measured from their respective means, we have
E(X1.23) =0 and E(e1.23) =0
By def,
![]()
![]()
![]()
![]()
![]()
Also
![]()
![]()
![]()
![]()
![]()

![]()
![]()
= ![]()
= ![]()
![]()
This formula
expresses the multiple correlation coefficient in terms of the total
correlation coefficients between the pairs of variables.
8.14.1 Properties of Multiple Correlation coefficient.
1. Multiple
correlation co-efficient measures the closeness of the association between the
observed values and the expected values of a variable obtained from the
multiple linear regression of that variable on other variables.
2. multiple
correlation coefficient between observed values and expected values, when the
expected values are calculated from a linear relation of the variables
determined by the method of least squares, is always greater than that where
expected values are calculated from any other linear combination of the
variables.
3. Since R1.23
is the simple correlation between X1 and e1.23, it must
lie between -1 and 1. But as seen in the above R1.23 is a
non-negative quantity and we conclude that
0≤ R1.23
≤ 1
4. if R1.23
=1 then association is perfect and all the regression residuals are zero, and
as such
. In the case ,
since X1 =e1.23 the predicted value of X1 the
multiple linear regression equation of X1 on X2 and X3
may be said to be a perfect prediction formula.
5. if R1.23
=0 then all total and partial correlations involving X1 are zero. So X1 is completely
uncorrelated with all the other variables in this case and the multiple
regression equation fails to throw any light on the value of X 1
when X2 and X3 are known.
6. R1.23
is not less than any total correlation coefficient
![]()
8.15. COEFFICIENT OF PARTIAL CORRELATION:
Sometimes the
correlation between two variables X1 and X2 may be partly
due to the correlation of a third variables, X3 with both X1 and X2. In such a situation, one
may want to know what the correlation between X1 and X2 would be if the effect
of X3 on each of X1 and X2 were eliminated. This
correlation is called the partial correlation and the correlation coefficient
between X1 and X2 after the linear effect of X3 on
each of them has been eliminated is called the partial correlation coefficient.
The residual X1.3
=X1 - b13 X3 may be regarded as that part of
the variable X1 which remains after the linear effect of X3
has been eliminated. Similarly the residual X2.3 may be interpreted
as the part of the variable X2 obtained after eliminating the linear
effect of X3. Thus the partial correlation coefficient between X1
and X2 usually denoted
by r12.3 is given by
![]()
We have
![]()
![]()

Also
![]()
![]()
![]()
![]()
Similarly we
shall get
![]()
Here

8.16. MULTIPLE CORRELATION IN TERMS OF TOTAL AND
PARTIAL CORRELATION:
![]()
Proof.
We have
![]()
![]()
Also
![]()
Hence the
result.
8.17. EXPRESSION FOR REGRESSION COEFFICIENT IN TERMS
OF REGRESSION COEFFICIENTS OF LOWER ORDER:
![]()
![]()
![]()
![]()
Dividing both
sides by N, the total number of observations, we get.
![]()
![]()
![if !vml]>![]()
![]()

In the case of
two variables we have
![if !vml]>

Hence we get
![]()
![]()


8.18. EXPRESSION FOR PARTIAL CORRELATION COEFFICIENT
IN TERMS OF CORRELATION COEFFICIENT OF LOWER ORDER:
![]()
![]()
![]()
![]()
We get
![]()

Also

Hence we get




