Correlation and Regression
Assignment help :: Statistics :: Correlation and Regression

8.                 CORRELATION AND REGRESSION

 

8.1BIVARIATE DISTRIBUTION:

In a bivariate distribution we may be interested to find out if there is any correlation or covariation between the two variables under study. If the changes in one variable affects a change in the other variable, the variables are said to be correlated. If the two variables deviate in the same direction, that is if the increase in one results in a corresponding increase in the other, correlation is said to be direct or positive. But if they constantly deviate in the opposite direction, that is if increase in one results in corresponding decrease in the other, correlation is said to be diverse or negative.

 

8.2 SCATTER DIAGRAM:

It is simplest way of the diagrammatic representation of bivariate data. Thus for the bivariate distribution (xi,yi) ; i=1,2,………n. if the values of the variables X and Y be plotted along the x-axis and y-axis respectively in the xy plan, the diagram of dots so obtained is known as scatter diagram. From the scatter diagram, we can form a fairly good, though vague, idea whether the variables are correlated or not, e.g., if the points are very dense, i.e., very close to each other we should expect a fairly good amount of correlation between the variables and if the points are widely scattered, a poor correlation is expected. This method, however, is not suitable if the number of observations is fairly large.

 

8.3 KARL PEARSON COEFFICIENT OF CORRELATION:

As a measure of intensity or degree of linear relationship between two variables, karl pearson a British Biometrician, developed a formula called correlation coefficient.

Correlation coefficient between two random variable x and Y, usually denoted by r(X,Y) or simply rXY is a numerical measure of linear relationship between them and is defined as

                               

 

8.4 CALCULATION OF THE CORRELATION COEFFICIENT FOR A BIVARIATE FREQUENCY DISTRIBUTION:

When the data are considerably large, they may be summarized by using a two-way table. Hence for each variable a suitable number of classes are taken, keeping in view the same considerations as in the univariate case. If there are n classes for X and m classes for Y, there will be in all m*n cells in the two-way table. By going through the pairs of values of X and Y, we can find the frequency for each cell. The whole set of cell frequencies will then define a bivariate frequency distribution. The column totals and row totals will give us the marginal distributions of X and Y. A particular column or row will be called the conditional distribution of Y for given X or of X for given Y respectively.

Suppose that the bivariate data on X and Y are presented in a two-way correlation table where there are m classes of Y placed along the horizontal line and n classes of X along a vertical line and fij is the frequency of individuals lying in the (I,j)the cell.

Here                      

Is the sum of the frequencies along any row and

                               

Is the sum of the frequencies along any column. We observe that

Then,                      <

                               

           

8.5 Probable Error of Correlation Coefficient. If r is the correlation coefficient in a sample of n pairs of observations, then its standard error is given by

                               

Probable error of correlation coefficient is given by

                               

Probable error is an old measure for testing the reliability of an observed correlation coefficient. The reason for taking the factor 0.6745 is that in a normal distribution, the range µ±0.6745 σ covers 50% of the total area. According to secrist, “the probable error of the correlation co-efficientis an amount which if added to and substracted from the mean correlation coefficient of correlation from a series selected at random will fall.”

If r< P.E>(r). correlation is not at all significant. If r> 6P.E.(r), it is definitely significiant. A rigorous method of testing the significance of an observed correlation coefficient will be discussed later in”test of significance” in sampling.

Probable error also enables us to find the limits within which the population correlation can be expected to vary. The limits are r±P.E.(r).

8.6 RANK CORRELATION:

Let us suppose that a group of n individuals is arranged in order of merit or proficiency in possession of two characteristics A and B. these ranks in two characteristics will, in general, be different. For example, if we consider the relation between intelligence and beauty, it is not necessary that a beautiful individual is intelligent also. Let (xi,yi); i=1,2,………,n be the ranks of the ith individual in two characteristics A and B respectively.  Pearsonian coefficient of correlation between the ranks xi’s and yi ‘s is called the rank correlation coefficient between A and B for that group of individuals.

Assuming that no two individuals are bracketed equal in either classification, each of the variables X and Y takes the values 1,2,………..,n

Hence

In general xi ≠ yi .  Let di =xi -yi

                               

Squaring and summin over I form 1 to n, we get

Dividing both side by n, we get

Where ![if !vml]>is the rank correlation coefficient between A and B.

Which is the spearman’s formula for the rank correlation coefficient.

 

 

8.7 REGRESSION:

The term “regression“literally mean “stepping back towards the average”. It was first used by a British biometrician Sir Francis Galton, in connection with the inheritance of stature. Galton found that the offspring’s of abnormally tall or short parents tend to “regress” or “step back” to the average population height. But the term “regression” as now used in statistics is only a convenient term without having any reference to biometry.

Regression analysis is a mathematical measure of the average relationship between two or more variables in terms of the original units of the data.

In regression analysis there are two types of variables. The variables whose value is influenced or is to be predicted is called dependent variable and the variable which influences the values or is used for prediction, is called independent variable. In regression analysis independent variable is also known as regressor or predictor or explanatory variable while the dependent variable is also known as regressed or explained variable.

 

8.7.1 Lines of regression:

If the variables in a bivariate distribution are related, we will find the points in the scatter diagram will cluster round some curve called the “curve of regression”. If the curve is a straight line, it is called the line of regression and there is said to be linear regression between the variables, otherwise regression is said to be curvilinear.

The line of regression is the line which gives the best estimate to the value of one variable for any specific value of the other variable. Thus the line of regression is the line of “best fit” and is obtained by the principle of least squares.

Let us suppose that in the bivariate distribution (xi, yi); i=1,2,……..n; Y is the dependent variable and X is independent variable. Let the line of regression of Y on X be Y = a+bX.

According to the principle of least squares, the normal equations for estimating a and b are

----------------------------(1)

And --------------(2)

From (1) on dividing by n, we get

---------------------(a)

Thus the lines of regression of Y on X passes through the point

Now

-------------------------------(3)

Also

--------------------------(4)

Dividing (2)by n and using (3) and (4) we get

------------------(5)

Multiplying (a) by and then subtracting from (5) we get

Since ‘b’ is the slope of the line of regression of Y on X and since the line of regression passes through the point (,), its equation is

--------------------(6)

-------------------(7)

Starting with the equation X = A+BY and proceeding similarly or by simply interchanging the variables X and Y in (6) and (7), the equation of the line of regression of X on Y becomes

8.7.2 Regression Curves:

In modern terminology, the conditional mean E(Y|X=x) for a continuous distribution is called the regression function of Y on X and the graph of this function of x is known as the regression curve of Y on X or sometimes the regression curve for the mean of Y. Geometrically, the regression function represents the y co- ordinate of the centre of mass of the bivariate probability mass in the infinitesimal vertical strip bounded by x and x+dx.

Similarly, the regression function of X on Y is E(X|Y=y) and the graph of this function of y is called the regression curve of X on Y.

In case a regression curve is a straight line, the corresponding regression is said to be linear. If one of the regression is linear, it does not however follow that the other is also linear.

 

8.7.3 Regression coefficients:

‘b’, the slope of the line of regression of Y on X is also called the coefficient of regression of Y on X. it represents the increment in the value of dependent variable Y corresponding to a unit change in the value of independent variable X. More precisely, we write

bYX = Regression coefficient of Y on X = μ11X2 = r*( σY/ σX)

similarly, the coefficient of regression of X on Y indicates the change in the value of variable X corresponding to a unit change in the value of variable Y and is given by

 bXY = Regression coefficient of X on Y = μ11Y2 = r*( σX/ σY)

 

8.7.4 Properties of Regression Coefficients:

(a) Correlation coefficient is the geometric mean between the regression coefficients.

(b) If one of the regression coefficients is greater than unity, the other must be less than unity.

(c) Arithmetic mean of the regression coefficients is greater than the correlation coefficient r, provided

r > 0.

(d) Regression coefficients are independent of the changes of origin but not of scale.

 

8.8 CORRELATION RATIO:

As discussed earlier, when variables are linearly related, we have the regression lines of one variable on another variable and correlation coefficient can be computed to tell us about the extent of association between them. However, if the variables are not linearly related but some sort of curvilinear relationship exists between them, the use of r which is a measure of the degree to which the relation approaches a straight line “law” will be misleading. We might come across bivariate distributions where r may be very low or even zero but the regression may be strong, or even zero but the regression may be strong, or even perfect. Correlation ratio ‘η’ is the appropriate measure of curvilinear relationship between the two variables. Just as r measures the concentration of points about the straight line of best fit, η measures the concentration of points about the curve of best fit. If regression is linear η=r otherwise η > r.

 

8.9 INTRA – CLASS CORRELATION:

Intra-class correlation means within class correlation. It is distinguishable from product moment correlation in as much as here both the variables measure the same characteristics. Sometimes specially in biological and agricultural study, it is of interest to know how the members of a family or group are correlated among themselves with respect to some one of their common characteristic. For example, we may require the require the correlation between the heights of brothers of a family or between yields of plots of an experimental block. In such cases both the variables measure the same characteristic, e.g., height and height or weight and weight. There is nothing to distinguish one from the other so that one may be treated as X-variable and the other as the Y-variable.

Suppose we have A1, A2, …….,An families with K1, K2,……….,Kn members, each of which may be represented as

                               

                                                                                   

and let xij(i=1,2,….n; j=1,2,……,ki) denote the measurement on the jth member in the ith family.

We shall have ki(ki-1) pairs for the ith family or group like (xij, xil),j≠1. There will be entries for all the n families or groups. The table is symmetrical about the principal diagonal. Such a table is called an intra-class correlation table and the correlation is called intra-class correlation.

In the bivariate table xi1 occurs (ki -1) times, xi2 occurs (ki -1) times,.xiki occurs (ki -1) times, i.e., from the ith family we have  and hence for all the n families we have as the marginal frequency, the table being symmetrical about principle diagonal.

 

          

Similarly,

               

 

Further

 ,        

                         

If we write

                                                               

Therefore intra-class correltion coefficient is given by

If we put ki = k , i.e., if all families have equal members then

                         

                         

Where σ2 denote the variance of X and σ2m the variance of means of families.

 

8.10 BIVARIATE NORMAL DISTRIBUTION:

                The bivariate normal distribution is a generalization of a normal distribution for a single variate. Let X and Y be two normally correlated variables with correlation coefficient and  ; , . In deriving the bivariate normal distribution we make the following three assumptions.

(i)                  The regression of Y on X is linear. Since the mean of each array is on the line of regression , the mean or expected value of Y is , for different values of X.

(ii)                The arrays are homoscedastic , i.e., vaiance in each array is same. The common variance of estimate of Y in each array is then given by , being the correlation coefficient between variables X and Y and is independent of X.

(iii)               The distribution of Y in different arrays in normal. Suppose that one of the variates , say X, is distributed normally with mean 0 and standard deviation σ1 so that the probability that a random. Value of X will fall in the small interval dx is

                               

 

The probability that a value of Y, take at random in an assigned vertical array will fall in the interval dy is

The joint probability differential of X and Y is given by

dp(x,y) =g(x)h(y/x) dxdy

Shitting the origin to (µ1,µ2) we get

Where are the five parameters of the distribution.

This is the density function of a bivariate normal distribution. The variables X and Y are said to be normally correlated and the surface z=f(x,y) is known as the normal correlation surface. The nature of the normal correlation surface is indicated in the above diagram.

 

8.11 MULTIPLE AND PARTIAL CORRELATION:

When the values  of one variable are associated with or influenced by other variable, e.g., the age of husband and wife, the height of father and son, the supply and demand of a commodity and so on, Karl Pearson’s coefficient of correlation can be used as a measure of linear relationship between them. But sometimes there is interrelation between many variables and the value of one variable may be influenced by many others, e.g., the yield of crop per acre say (X1) depends upon quality of seed (X2) fertility of soil (X3) fertilizer used (X4) irrigation facilities (X5) weather conditions (X6) and so on. Whenever we are interested in studying the joint effect of a group of variables upon a variable not included in that group our study is that of multiple correlation and multiple regression.

Suppose in a trivariate or multi-variate distribution we are interested in the relationship between two variables only. There are two alternatives, viz (i) we consider only those two members of the observed data in which the other members have specified values or (ii) we may eliminate mathematically the effect of other variates on two variates. The first method has the disadvantage that it limits the size of the data and also it will be applicable to only the data in which the other variates have assigned values. In the variates but the linear effect can be easily eliminated. The correlation and regression between only two variates eliminating the linear effect of other variates in them is called the partial correlation and partial regression.

 

8.12 PLANE OF REGRESSION:

The equation of the plane of regression of X1 on X2 and X3 is

                               

The constants b’s  are determined by the principle of least square ,i.e., minimizing he sum of the squares of he residuals, viz.,

                               

The summation being extended to the given values (N in number) of the variables.

The normal equations for estimating and  are

 and

Since Xi’s are measured from their respective means, we have

                ,

               

 

Hence we get

               

 

Solving the equation for b12.3 and b13.2  we get

 

Similarly, we will get

               

If we write

                ω =      

and ωij is the cofactor of the element in the ith row and jth column of ω .

 

8.13 PROPERTIES IF RESIDUALS:

Property 1:

The sum of the product of any residual of order zero with any other residual of higher order is zero, provided the subscript of the former occurs among the secondary subscripts of the latter.

 

The normal equations for estimating b’s in trivariate and n-variate distributions as obtained in equations are

               

 

Respectively. Here Xi = (i=1,2,3,……………..,n) can be regard as a residual of order zero. Hence the result.

Property 2:

The sum of the product of any two residuals in which all the secondary subscripts of the first occur among the secondary subscripts of the second is unaltered if we omit any or all of the secondary subscripts of the first. Conversely, the product sum of any residual of order ‘p’ with a residual of order p+q , the ‘p’ subscripts being the same in each case is unaltered by adding to the secondary subscripts of the former any or all the ‘q’ additional subscripts of the latter

Let us consider.

                          =

                   

Again

Hence the property 2.

Property 3:

The sum of the product of two residuals is zero if all the subscripts of the one occur among the secondary subscripts of the other ,

 

=0

Hence the property3.

 

8.14. COEFFICIENT OF MULTIPLE CORRELATION:

In a tri-variate distribution in which each of the variable X1,X2, and X3 has N observations, the multiple correlation coefficient of X1 on X2 and X3 usually denoted by R 1.23 is the simple correlation coefficient between X1 and the joint effect of X2 and X3 on X1. In other words R1.23 is the correlation coefficient between X1 and its estimated value as given by the plane of regression of X1 on X2 and X3.

       

Since Xi’s are measured from their respective means, we have

          E(X1.23) =0  and E(e1.23) =0 

By def,

         

                   

                  

                 

 

Also

                               

                               

                               

         

         

     =

   =

This formula expresses the multiple correlation coefficient in terms of the total correlation coefficients between the pairs of variables.

 

8.14.1 Properties of Multiple Correlation coefficient.

1. Multiple correlation co-efficient measures the closeness of the association between the observed values and the expected values of a variable obtained from the multiple linear regression of that variable on other variables.

2. multiple correlation coefficient between observed values and expected values, when the expected values are calculated from a linear relation of the variables determined by the method of least squares, is always greater than that where expected values are calculated from any other linear combination of the variables.

3. Since R1.23 is the simple correlation between X1 and e1.23, it must lie between -1 and 1. But as seen in the above R1.23 is a non-negative quantity and we conclude that

0≤ R1.23 ≤ 1

4. if R1.23 =1 then association is perfect and all the regression residuals are zero, and as such  . In the case , since X1 =e1.23 the predicted value of X1 the multiple linear regression equation of X1 on X2 and X3 may be said to be a perfect prediction formula.

5. if R1.23 =0 then all total and partial correlations involving X1  are zero. So X1 is completely uncorrelated with all the other variables in this case and the multiple regression equation fails to throw any light on the value of X 1 when X2 and X3 are known.

6. R1.23 is not less than any total correlation coefficient

                               

 

8.15. COEFFICIENT OF PARTIAL CORRELATION:

Sometimes the correlation between two variables X1 and X2 may be partly due to the correlation of a third variables, X3 with both X1  and X2. In such a situation, one may want to know what the correlation between X1  and X2 would be if the effect of  X3 on each of X1  and X2 were eliminated. This correlation is called the partial correlation and the correlation coefficient between X1  and X2  after the linear effect of X3 on each of them has been eliminated is called the partial correlation coefficient.

The residual X1.3 =X1 - b13 X3 may be regarded as that part of the variable X1 which remains after the linear effect of X3 has been eliminated. Similarly the residual X2.3 may be interpreted as the part of the variable X2 obtained after eliminating the linear effect of X3. Thus the partial correlation coefficient between X1  and X2 usually denoted by r12.3 is given by

 

         

We have

                               

                               

Also

 

       

                      

                     

                  

 

Similarly we shall get

 

Here

 

 

8.16. MULTIPLE CORRELATION IN TERMS OF TOTAL AND PARTIAL CORRELATION:

 

Proof.

We have

               

Also

Hence the result.

 

8.17. EXPRESSION FOR REGRESSION COEFFICIENT IN TERMS OF REGRESSION COEFFICIENTS OF LOWER ORDER:

                                               

                                               

                                               

Dividing both sides by N, the total number of observations, we get.

![if !vml]>

                                               

                                               

In the case of two variables we have

![if !vml]>

Hence we get

                                               

 

8.18. EXPRESSION FOR PARTIAL CORRELATION COEFFICIENT IN TERMS OF CORRELATION COEFFICIENT OF LOWER ORDER:

 

We get

Also

Hence we get

 

 

 

 

 

 


Science help | Science homework help | Help with science | Science fair help | Science project help | Help for science | Help physical science | Help on science | Science help online | Help with science homework | Science fair project help | Earth science help | Science help me | Science helps | Kids science help | Help in science |Science projects help | Help with science project | Homework help for science | Science help for kids | Online tutoring