Sampling and Large Sample Tests
Assignment help :: Statistics :: Sampling and Large Sample Tests

10.   SAMPLING AND LARGE SAMPLE TESTS

 

10.1 INTRODUCTION:

Before giving the notion of sampling we will first define population. In a statistical investigation the interest usually lies in the assessment of the general magnitude and the study of variation with respect to one or more characteristics relating to individuals belonging to a group. This group of individuals under study is called population or universe, thus in statistics, population is an aggregate of objects, animate or inanimate, under study. The population may be finite or infinite.

It is obvious that for any statistical investigation complete enumeration of the population is rather impracticable. For example, if we want to have an idea of the average per capita income of the people India, we will have to enumerate all the earning individuals in the country, which is rather a very difficult task.

If the population is infinite, complete enumeration is not possible. Also if the units are destroyed in the course of inspection 100% inspection, through possible, is not at all desirable. But even if the population is finite or the inspection is not destructive, 100% inspection is not taken recourse to because of multiplicity of causes, viz, administrative and financial implications, time factor, etc., and we take the help of sampling.

A finite subset of statistical individuals in a population is called a sample and the number of individuals in a sample is called the sample size.

For the purpose of determining population characteristics, instead of enumerating the entire population, the individuals in the sample only are observed. Then the sample characteristics are utilized to approximately determine or estimate the population. For example, on examining the sample of a particular stuff we arrive at a decision of purchasing or rejecting that stuff. The error involved in such approximation is known as sampling error and is inherent and unavoidable in any and every sampling scheme. But sampling results in considerable gains, especially in time and cost not only in respect of making observations of characteristics but also in the subsequent handling of the data.

Sampling is quite often used in our day – to – day practical life. For example, in a shop we assess the quality of sugar, wheat or any other commodity by taking a handful of it from the bag and then decide to purchase it or not. A housewife normally tests the cooked products to find if they are properly cooked and contain the proper quantity of salt.

 

10.2 TYPES OF SAMPLING:

Some of the commonly known and frequently used types of sampling are:

(i)          Purposive sampling

(ii)        Random sampling

(iii)       Stratified sampling

(iv)      Systematic sampling

Below we will precisely these terms, without entering into detailed discussion.

 

10.2.1       Purposive sampling:

Purposive sampling is one in which the sample units are selected with definite purpose in view. For example, if we want to give the picture that the standard of living has increased in the city of New Delhi, we may take individuals in the sample from rich and posh localities like Defence Colony, South Extension, Golf Links, Jor Bagh, Chanakyapuri, Greater Kailash etc. and ignore the localities where low income group and the middle class families live. This sampling suffers from the drawback of favouritism and nepotism and does not give a representative sample of the population.

10.2.2       Random sampling:

In this case the sample units are selected at random and the drawback of purposive sampling, viz., favouritism or subjective element, is completely overcome. A random sample is one in which each unit of population has an equal chance of being included in it.

Suppose we take a sample of size n from a finite population of size N. then there are NCn possible samples. A sampling technique in which each of the NCn samples has an equal chance of being selected is known as random sampling and the sample obtained by this technique is termed as a random sample.

Proper care has to be taken to ensure that the selected sample is random. Human bias, which varies from individual to individual, is inherent in any sampling scheme administered by human beings. Fairly good random samples can be obtained by the use of Tippet’s random number tables or by throwing or a dice, draw of a lottery, etc.

The simplest method, which is normally used, is the lottery system which is illustrated below by means of an example.

Suppose we want to select ‘r’ candidates out of n. we assign the numbers one to n, one number to each candidate and write these numbers on n slips which are made as homogeneous as possible in shape, size, etc. these slips are then put in a bag and thoroughly shuffled and then ‘r’ slips are drawn one by one. The ‘r’ candidates corresponding to the numbers on the slips drawn will constitute the random sample.

 

10.2.3       Simple sampling:

Simple sampling is random sampling in which each unit of the population has an equal chance, say p, of being included in the sample and that this probability is independent of the previous drawings. Thus a simple sample of size n from a population may be identified with a series of n independent trials with constant probability ‘p’ of success for each trial.

 

10.2.4       Stratified sampling:

Here the entire heterogeneous population is divided into a number of homogeneous groups, usually termed as strata, which differ from one another but each of these groups is homogenous within itself. Then units are sampled at random from each of this stratum, the stratum in the population. The sample, which is the aggregate of the sampled units of each of the stratum, is termed as stratified sample and the technique of drawing this sample is known as stratified sampling. Such a sample is by far the best and can safely be considered as representative of the population from which it has been drawn.

 

10.3 PARAMETER AND STATISTIC:

In order to avoid verbal confusion with the statistical constants of the population, viz., mean (μ), variance σ2, etc., which are usually referred to as parameters, statistical measures computed from the sample observations alone, eg., mean (x bar), variance s2, etc., have been termed by Professor R.A Fisher as statistics.

In practice, parameter values are not known and the estimates based on the sample values are generally used. Thus statistic which may be regarded as an estimate of parameter, obtained from the sample, is a function of the sample values only. It may be pointed out that a statistic, as it is based on sample values and as there are multiple choices of the samples that can be drawn from a population, varies from sample to sample. The determination or the characterization of the variation that may be attributed to chance or fluctuations or sampling is one of the fundamental problems of the sampling theory.

 

10.3.1       Sampling distribution of a statistic:

If we draw a sample of size n from a given finite population of size N, then the total number of possible samples is:

For each of these k samples we can compute some statistic t = t(x1, x2,…… xn), in particular the mean , the variance s2, etc., as given below:

Sample number

t

s2

1

t1

s12

2

t2

s22

3

t3

S32

.

 

.

.

.

.

.

.

.

k

tk

sk2

 

 

The set of the values of the statistic so obtained, one for each sample, constitutes what is called the sampling distribution of the statistic, for example, the values t1, t2,…… tk and we can compute the various statistical constants like mean, variance, skewness, kurtosis etc., for its distribution. For example, the mean and variance of the sampling distribution of the statistic t are given by:

 

10.4 TESTS OF SIGNIFICANCE:

A very important aspect of the sampling theory is the study of the tests of significance, which enable us to decide on the basis of the sample results, if

(i)          The deviation between the observed sample statistic and the hypothetical parameter value, or

(ii)        The deviation between two independent sample statistics:

Is significant or might be attributed to chance or the fluctuations of sampling is significant or might be attributed to chance or the fluctuations of sampling.

Since, for large n, almost all the distributions, eg., Binomial, Poisson, Negative binomial, Hyper geometric, t, F, chi square can be approximated very closely by a normal probability curve, we use the normal test of significance for large samples. Some of the well known tests of significance for studying such differences for small samples are t-test, F-test and Fisher’s z –transformation.

 

10.5 NULL HYPOTHESIS:

The technique of randomization used for the selection of sample units makes the test of significance valid for us, for applying the test of significance we first set up a hypothesis a definite statement about the population parameter. Such a hypothesis, which is usually a hypothesis of no difference is called null hypothesis and is usually denoted by H0. According to Prof. R.A.Fisher, null hypothesis is the hypothesis which is tested for possible rejection under the assumption that it is true.

For example, in case of a single statistic, H0 will be that the sample statistic does not differ significantly from the hypothetical parameter values and in the case of two statistics, H0 will be that the sample statistics do not differ significantly.

Having set up the null hypothesis we compute the probability p that the deviation between the observed sample statistic and the hypothetical parameter value might have occurred due to fluctuations of sampling. If the deviation comes out to be significant, null hypothesis is rejected at the particular level of significance adopted and if the deviation is not significant, null hypothesis may be retained at that level.

 

10.5.1       Alternative Hypothesis:

Any hypothesis which is complementary to the null hypothesis is called an alternative hypothesis, usually denoted by H1. For example, if we want to test the null hypothesis that the population has a specified mean μ0, that is H0: μ = μ0, then the alternative hypothesis could be

(i)          H1: μ ≠ μ0

(ii)        H1: μ > μ0

(iii)       H1: μ < μ0

The alternative hypothesis in (i) is known as a two tailed alternative and the alternative in (ii) and (iii) are known as right tailed and left-tailed alternatives respectively. The setting of alternative hypothesis is very important since it enables us to decide whether we have to use as single-tailed or two tailed test.

 

 

10.6 ERRORS IN SAMPLING:

 The main objective in sampling theory is to draw valid inferences about the population parameters on the basis of the sample results. In practice we decide to accept or reject the lot after examining a sample from it. As such we are liable to commit the following two types of errors:

Type I Error: Reject H0 when it is true.

Type II Error: Accept H0 when it is wrong, that is accept H0 when H1 is true.

If we write.

P { Reject H0 when it is true} = P{ Reject H0| H0} =α

and P { Accept H0 when it is wrong} = P{ accept H0| H1} =β

then α and β are called the sizes of type I error and type II error, respectively.

In practice, type I error amounts to rejecting a lot when it is good and type II error may be regarded as accepting the lot when it is bad.

Thus P { Reject a lot when it is good} = α

and P { Accept a lot when it is bad } = β

where α and β are referred to as Producer’s risk and consumer’s risk respectively.

 

10.7 CIRITICAL REGION AND LEVEL OF SIGNIFICANCE:

A region in the sample space S which amounts to rejection of H0 is termed as critical region or region of rejection. If ω is the critical region and if t = t(x1, x2,………… xn) is the value of the statistic based on a random sample of size n, then

P( t Є ω| H0 ) = α,

P(t Є | H1 ) = β

Where , the complementary set of ω, is called the acceptance region.

We have ω =S and ω 

The probability ‘α’ that a random value of the statistic t belongs to the critical region is known as the level of significance. In other words level of significance is the size of the type I error. The levels of significance usually employed in testing of hypothesis are 5 % and 1%. The level of significance is always fixed in advance before collecting the sample information.

 

10.7.1       One tailed and two tailed tests:

In any test, the critical region is represented by a portion of the area under the probability curve of the sampling distribution of the test statistic.

A test of any statistical hypothesis where the alternative hypothesis is one tailed is called a one tailed test. For example, a test for testing the mean of a population

H0: μ = μ0

Against the alternative hypothesis:

H1: μ > μ0 (right tailed) or H1: μ < μ0 (left tailed)

Is a single tailed test. In the right test (H1: μ > μ0), the critical region lies entirely in the right tail of the sampling distribution or , while for the left tail test (H1: μ < μ0), the critical region is entirely in the left tail or the distribution.

A test of statistical hypothesis where the alternative hypothesis is two tailed such as:

H0: μ = μ0, against the alternative hypothesis H1: μ ≠ μ0 is known as two tailed test and in such a case the critical region is given by the  portion of the area lying in both the tails of the probability curve of the test statistic.

In a particular problem, whether one tailed or two tailed test is to be applied depends entirely on the nature of the alternative hypothesis. If the alternative hypothesis is two tailed we apply two tailed test and if alternative hypothesis is one tailed, we apply one tailed test.

 

10.7.2       Critical values or significant values:

The value of test statistic which separates the critical region and the acceptance region is called the critical value or significant value. It depends upon:

(i)          The level of significance used, and

(ii)        The alternative hypothesis, whether it is two tailed or single tailed.

As has been pointed out earlier, for large samples, the standardized variable corresponding to the statistic t viz. :

------------------------(a)

Asymptotically as n ΰ ∞. The value of Z given by (a) under the null hypothesis is known as test statistic. The critical value of the test statistics at level of significance α for a two- tailed test is given by Zα where Zα is determined by the equation

P(|Z|> Zα) = α ------------------(1)

That is Zα is the value so that the total area of the critical region on both tails is α. Since normal probability curve is a symmetrical curve, from (1), we get

P(Z > Zα) + P(Z < -Zα) = α

θ  P (Z > Zα) + P(Z >Zα) = α

θ  2P (Z > Zα) = α

θ  P( Z > Zα) = α/2

That is the area of each tail is α/2. Thus Zα is the value such that area to the right of Zα is α/2 and to the left of - Zα is α/2.

In case of single tail alternative, the critical value Zα is determined so that total area to the right of it is α and for left tailed test the total area to the left of - Zα is α.

Thus the significant or critical value of Z for a single tailed test (left or right) at level of significance ‘α’ is same as the critical value of Z for a two tailed test at level of significance ‘2α’.

 

10.7.3       Procedure for testing of hypothesis:

We now summarise below the various steps in testing of a statistical hypothesis in a systematic manner.

1.          Null hypothesis: set up the null hypothesis H0.

2.          Alternative hypothesis: set up the alternative hypothesis H1. This will enable us to decide whether we have to use a single tailed test or two tailed test.

3.          Level of significance: choose the appropriate level of significance (α) depending on the reliability of the estimates and permissible risk. This is to be decided before sample is drawn, that is α is fixed in advance.

4.          Test statistic: compute the test statistic

Under the null hypothesis.

5.          Conclusion: we compare z the computed value of z in step 4 with the significant value zα, at the given level of significance, ‘α’.

If |Z| < zα, that is if the calculated value of Z is less than zα we say it is not significant. By this we mean that the difference t – E(t) is just due to  fluctuations of sampling and the sample data do not provide us sufficient evidence against the null hypothesis which may therefore, be accepted.

If |Z| > zα, that is if the computed value of test statistic is greater than the critical or significant value, then we say that it is significant and the null hypothesis is rejected at level of significance α that is with confidence coefficient (1- α).

 

10.8 TEST OF SIGNIFICANCE FOR LARGE SAMPLES:

In this section we will discuss the test of significance when samples are large. We have seen that for large values of n, the number of trials, almost all the distributions, eg., binomial, Poisson, Negative binomial, etc., are very closely approximated by normal distribution. Thus in this case we apply the normal test, which is based upon the following fundamental property of the normal probability curve.

If X ≈ n( μ, σ2), then z =

Thus from the normal probability tables, we have

P( -3 ≤ Z ≤ 3) = 0.9973 that is P(|Z|≤ 3 ) = 0.9973

θ  P(|Z|> 3 ) = 1- P( |Z| ≤ 3) = 0.0027

That is in all probability we should expect a standard normal variate to lie between ± 3.

Also from the normal probability tables, we get

P ( -1.96 ≤ Z ≤ 1.96 ) = 0.95 that is P(|Z| ≤ 1.96) = 0.95

θ  P(|Z| > 1.96 ) = 1 – 0.95 = 0.05

θ  P(|Z| ≤ 2.58 ) = 0.99

θ  P(|Z| > 2.58 ) = 0.01

Thus the significant values of Z at 5% and 1% level of significance  for a two tailed test are 1.96 and 2.58 respectively.

Thus the steps to be used in the normal test are as follows:

(i)          Compute the test statistic Z under H0.

(ii)        If |Z| >3, H0 is always rejected.

(iii)       If |Z| ≤3, we test its significance at certain level of significance, usually at 5% and sometimes at 1% level of significance. Thus, for a two tailed test if |Z| >1.96, H0 is rejected at 5% level of significance.

Similarly if |Z| > 2.58, H0 is contradicted at 1% level of significance and if |Z| ≤ 2.58, H0 may be accepted at 1% level of significance.

From the normal probability tables, we have:

P(Z >1.645) = 0.5 – P (0 ≤Z ≤1.645)

= 0.5 – 0.45

= 0.05

P(Z> 2.33) = 0.5 - P(0 ≤Z ≤2.33)

= 0.5 – 0.49

= 0.01

Hence for a single tail test we compare the computed value of |Z| with 1.645 and 2.33 and accept or reject H0 accordingly.

 

10.9 SAMPLING OF ATTRIBUTES:

Here we shall consider sampling from a population which is divided into two mutually exclusive and collectively exhaustive classes one class possessing a particular attribute, say A, and the other class not possessing that attribute, and then note down the number of persons in the sample of sizen, possessing that attribute. The presence of an attribute in sampled unit may be termed as success and its absence as failure. In this case a sample of n observations is identified with that of a series of n independent Bernoulli trials with constant probability P of success for each trial. Then the probability of x successes in n trials, as given by the binomial probability distribution is

10.9.1       Test for single proportion:

If X is the number of successes in n independent trials with constant probability P of success for each trial

E(X) = nP and V(X) = n PQ

Where Q = 1- P, is the probability of failure.

It has been proved that for large n, the binomial distribution tends to normal distribution. Hence for large n, X≈ N(nP, nPQ) that is

And we apply the normal test.

 

10.9.2       Test of significance for difference of proportions:

Suppose we want to compare two distinct populations with respect to the prevalence of a certain attribute, say A, among their members. Let X1, X2 be the number of persons possessing the given attributes respectively. Then sample proportions are given by

 and

If P1 and P2 are the population proportions, then

E(p1) = P1, E(p2) = P2

  and

Since for large sample, p1 and p2 ­are asymptotically normally distributed (p1- p2) is also normally distributed. Then the standard variable corresponding to the difference (p1- p2) is given by

 

Thus the test statistic in this case is

 

10.10       SAMPLING OF VARIABLES:

In the case of sampling variables each member of the population provides the value of the variable and the aggregate of these values forms the frequency distribution of the population. From the population, a random sample of size n can be drawn by any of the sampling methods discussed before which is same as choosing n values of the given variables from the distribution.

 

10.11      UNBIASED ESTIMATE FOR POPULATION MEAN (μ) AND VARIANCE (σ2):

Let x1, x2,…… xn be a random sample of size n from a large population X1, X2,……………. XN with mean μ and variance σ2. Then the sample mean () and variance (s2) are given by

and

Now E () =

Since xi is a sample observation from the population Xi, (I = 1, 2, ……N) it can take any one of the values X1, X2,……………. XN each with equal probability 1/N.

Thus the sample mean is an unbiased estimate of the population mean μ.

Now

----------------------------(1)

We have V(xi) = E[xi-E(xi)]2 = E(xi-μ]2

Also we know that

In particular,

-----------------------(2)

, where is the population variance.

-----------------------------(3)

Substituting from (2) and (3) in (1) we get

-------------(4)

Since  , sample variance is not an unbiased estimate of population variance.

From (4), we get

Where

Therefore S2 is an unbiased estimate of the population variance σ2.

 

10.12STANDARD ERROR OF SAMPLE MEAN:

The variance of the sample mean is σ2/n, where σ is the population standard deviation and n is the size of the random sample.

The standard error of mean of a random sample of size n from a population with variance σ2 is

Proof:

Let x1, x2,…… xn be a random sample of size n from a population with variance σ2, then the sample mean is given by

The covariance terms vanish since the sample observations are independent

But

Standard error

 

10.12       TEST OF SIGNIFICANCE FOR SINGLE MEAN:


we have proved that if xi, (I = 1, 2, ……………, n ) is a random sample of size n from a normal population with mean μ and variance σ2, then the sample mean is distributed normally with mean σ2/n that is . However, this result holds, that is , even in random sampling from non-normal population provided the sample size n is large by central limit theorem.

Thus for large samples, the standard normal variate corresponding to is:

Under the null hypothesis, H0that the sample has been drawn from a population with mean μ and variance σ2 that is there is no significant difference between the sample mean and population mean (μ), the test statistic is:

 

10.14 TEST OF SIGNIFICANE FOR DIFFERENCE OF MEANS:

Let be the mean of a random sample of size n1 from a population with mean μ1 and variance σ12 and let be the mean of an independent random sample of size n2 from another population with mean μ2 and variance σ22. Then, since sample sizes are large,

 and

Also , being the difference of two independent normal variates is also a normal variate. The Z standard normal variable corresponding to is given by

Under the null hypothesis H0: μ1 = μ2, that is there is no significant difference between the sample means, we get

The covariance term vanishes, since the sample mean and are independent.

Thus under H0: μ1 = μ2, the test statistic becomes

 

10.15 TEST OF SIGNIFICANCE FOR THE DIFFERENCE OF STANDARD DEVIATIONS:

If s1 and s2 are the standard deviations of two independent samples, then under null hypothesis, H0: σ1 = σ2 that is the sample standard deviations don’t differ significantly, the statistic

for large samples.

But in case of large samples, the standard error(S.E.) of the difference of the sample standard deviations is given by

σ12 and σ22 are usually unknown and for large samples, we use their estimates given by the corresponding sample variances. Hence the test statistic reduces to

 

 

 

 

 


Science help | Science homework help | Help with science | Science fair help | Science project help | Help for science | Help physical science | Help on science | Science help online | Help with science homework | Science fair project help | Earth science help | Science help me | Science helps | Kids science help | Help in science |Science projects help | Help with science project | Homework help for science | Science help for kids | Online tutoring