Data can be defined as any quantitative or qualitative values of a variable. Data involves facts and statistics collected together for analysis.
There are two sources of data collection:
The primary data refers to the original data that is collected by the researcher for the first time and for a specific purpose. Following are some of the sources through which primary data is collected:
Primary research done to collect primary data is a time consuming and expensive process.
Secondary data is that data which is collected by someone other than the user. It is that type of data that has already been collected and is readily available through different sources.
Random sampling is a form of sampling technique in which each sample from a population has an equal probability of being chosen.
Following are different types of random sampling-
In statistics, a frequency distribution is a tabular or graphical form that displays the frequencies of various outcomes in a sample.
In statistics, Univariate (“Uni” means “one” and “variate” means “variable”) is a commonly used term that describes a data which consists of observations on only a single attribute or characteristic (variable). In other words, a univariate distribution has only one variable.
Thus, when data is classified (summarized or grouped in the form of a frequency distribution) on the basis of a single variable, the distribution so formed is known as univariate frequency distribution.
Consider the following example of a univariate frequency distribution-
Following data shows the marks (variable) obtained in a test (out of100) by 60 students of a class-
Let the marks of students be denoted by variable X and number of students by f (frequency).
Marks (X) | No. of students (f) |
---|---|
20 | 12 |
30 | 8 |
40 | 10 |
50 | 20 |
60 | 4 |
70 | 6 |
N=∑f=60 |
Central tendency or an average is a middle or central point of a probability distribution.
According to Prof Bowley “Measures of central tendency (averages) are statistical constants which enable us to comprehend in a single effort the significance of the whole”.
Central tendency helps in summarizing the data in a single value and thus enabling comparisons between data.
Measures of central tendency or an average is a measure that attempts to describe a set of data with a single value that represents the middle or center of the distribution. Measure of central tendency includes:
Arithmetic mean (or simply mean or average) is the most commonly used method of representing the entire data set by one value. It is the most popular measure of central tendency as it is easily understandable.
Sample mean (X̅) refers to the mean of statistical samples while population mean (μ) is the mean of the whole population under study. Sample mean provides an estimate of the population mean. The formulae for calculating population mean and sample mean are same. Here, we analyze the concept of Arithmetic mean as a sample mean.
Arithmetic mean can be of two types:
The mean is defined as the sum of numerical values of the observations in a data set divided by the total number of observations in that data set.
Sample mean or Arithmetic mean (here, we are analyzing Arithmetic mean as sample mean) is denoted by X̅.
A. INDIVIDUAL SERIES
Direct method
Suppose we have a data set containing n observations with values b1, b2, b¬3, ……bn. In this case Arithmetic mean will be:
X̅= (b1+b2+b3+⋯+bn)/n
Or, X̅=∑b/n
Short cut method
Arithmetic mean in this method is calculated by using an arbitrary origin (A). this arbitrary origin is also known as assumed mean. The deviations of values are taken and then the following formula is used to find mean-
X̅=A+(∑d)/N
A=Assumed mean (arbitrary origin)
d= deviation of values from assumed mean=(X-A)
N=Number of observations
Arithmetic mean in case of ungrouped Frequency distribution
Direct method
Formula for mean calculation under direct method is as follows-
X̅=∑fX/N
f= Frequency
X= the value of variable in question
N=∑f= Total no of observations
Short-cut method
The formula used in this method is-
X̅=A+∑fd/N
A=Assumed mean (arbitrary origin)
d=(X-A)
N=∑f= Total no of observations
B. CONTINUOUS SERIES (grouped frequency distribution)
Grouped frequency distribution is the organizing of the raw data using classes and frequencies.
In a continuous series, instead of individual values like 10, 20, 30, …n, we have data in the form of class intervals such as 10-20, 20-30, 30-40….so on.
Direct method
Under this method the following formula is used to calculate the mean-
X̅=∑fm/N
m= mid-point of various classes given in question
f= frequency of each class
N=∑f= Total no of observations
Short-cut method
This method suggests the use of following formula to calculate arithmetic mean-
X̅=A+(∑fd)/N
A=Assumed mean (arbitrary origin)
d=(m-A) = deviation of mid-points from assumed mean
N=∑f= Total no of observations
Following formula is used under this method- X̅=A+(∑fd)/N×h
A=Assumed mean (arbitrary origin)
d=(m-A) = deviation of mid-points from assumed mean
N=∑f= Total no of observations
h = Class interval (example- when we have a class 30-40, the class interval here is 10)
Simple arithmetic mean gives equal importance to all the items in a data set. However, there may be some items whose relative importance in the data is not same. The weighted arithmetic mean helps in calculation of mean by assigning weights to each item. The term weight here stands for the relative importance of different items of the data.
The formula required to calculate arithmetic mean is as follows-
X̅w = ∑WX/∑W (individual series)
X̅w = ∑W(fX)/∑W (frequency distribution)
Geometric mean is defined as the nth rood of the product of N items or values.
GM = n√(X1)×(X2)…×(Xn))
Where X1, X2, X3 etc. represent various items of the series.
- In case of individual series, we use the following formula- GM = antilog( ∑logX)/ N )
- In case of ungrouped frequency distribution, we have items X1, X2, X3…. Xn and corresponding frequencies as f1, f2, f3…… fn. The formula used to calculate Geometric mean is- GM = antilog( ∑flogX)/ N )
- In case of grouped frequency distribution, we use the following formula- GM = antilog(∑flog(m)) / N )
Where m = midpoint of various classes given in question
Harmonic mean is defined as the reciprocal of the arithmetic mean of the reciprocal of the individual observation.
HM = N /(1/X1 + 1/X2 + ⋯ + 1/Xn)
In individual series, HM = N / ∑(1/X)
In ungrouped frequency distribution, HM = N / ∑(f×1/X)
In grouped frequency distribution, HM = N / ∑(f×1/m)
Where m = mid-point of various classes given in question
Mode or the modal value is that value in a series of observations which has the highest frequency. For example, the mode of the individual series 3,5,7,5,6 would be 5 as this value occurs more frequently than any other value in the series.
There may be no unique modal value in a series. There may be two modes (known as bimodal), three modes (known as tri modal). When there are more than three modes, it will be a case of multi modal.
- In case of ungrouped frequency distribution, the item with maximum frequency represents the mode.
- When we have grouped frequency distribution, then first we need to calculate the modal class in which the modal value lies. The modal class is that class which has the largest frequency. After we have identified the modal class, then we use the following formula to calculate the modal value of the distribution. We use the following formula-
Mode = L + (f1-f)/(2f1-f0-f2) × h
Where L= Lower limit of the modal class
f1 = frequency of the modal class
f0 = frequency of the class preceding the modal class
f2 = frequency of the class succeeding the modal class
h = class interval of the modal class
- The above formula can be used only when the class intervals are uniform throughout the distribution.
- The above formula cannot be used when we are given a multimodal distribution. In this case we calculate empirical mode using the following relation- Mode = 3 Median-2 Mean
Median refers to the middlemost value in a distribution. It is known as a positional average. A change in the value of a single item will not cause any change in the value that divides the observations into two equal parts.
Cumulative frequency in simple words refers to the running total of all the frequencies.
Consider an Example-
X (income) | No. of person (f) |
---|---|
200 | 24 |
250 | 26 |
180 | 16 |
300 | 20 |
350 | 6 |
280 | 30 |
Solution-
In order to find median, we first rank the income in ascending order and then calculate cumulative frequency.
X (income) | No. of person (f) | Cumulative frequency (cf) |
---|---|---|
180 | 16 | 16 |
200 | 24 | 40 |
250 | 26 | 66 |
280 | 30 | 96 |
300 | 20 | 116 |
350 | 6 | 122 |
N=122
Median = size of (N+1)/2 th item
Median = 122+1/2 th item
Therefore, Median = 61.5th item
From the cumulative frequency column, we find that
61.5th item = 250.
Hence, Median = 250
- In grouped frequency distribution, we are given different classes. So first we have to find that particular class in which the value of median lies.
The formula used to find median after we have the median class is-
Median = L + (N/2-cf/ f) × h
Where L= Lower limit of the median class
cf= Cumulative frequency of the class preceding the median class
f= Frequency of the median class
h= Class interval of the median class
In case of grouped frequency distribution, we use N/2 as the rank of the median instead of (N+1)/2.
According to Spiegel “The degree to which the numerical data tend to spread about an average value is called the dispersion of the data”.
Dispersion measures the extent to which the items vary from some central value.
Range is the simplest method of studying dispersion. It is defined as the difference between the value of the smallest item and the value of the largest item included in the distribution.
Range = Largest value – smallest value
Relative measure of range is known as coefficient of range.
Coefficient of range = (L-S)/(L+S)
Where L= Largest value
S= smallest value
Quartile deviation or semi-interquartile range includes the middle 50% of the distribution only. Interquartile range refers to the difference between the third quartile (Q3) and first quartile (Q1).
Interquartile range = Q3 - Q1
Quartile deviation (QD) is represented as-
QD = (Q3 - Q1)/2
Quartile deviation gives the average amount by which the two quartiles differ from the median. Coefficient of quartile deviation which is a relative measure is calculated as follows-
QD = (Q3 - Q1)/(Q3+ Q1)
This relative measure is used to compare the degree of variation in different distributions.
Mean deviation is the average difference between the items (X) in a distribution from mean or median of that series of data.
MD or mean deviation = (∑∣D∣)/N (In individual series)
Where, ∣D∣ is the absolute value of the deviation from median or mean.
MD or mean deviation = (∑f∣D∣)/N (In ungrouped frequency distribution)
∣D∣ = ∣X-Me/mean∣
MD or mean deviation = (∑f∣D∣)/N × h (In grouped frequency distribution)
∣D∣ = ∣m-median/mean∣
(m=midpoint of classes in question)
- Relative measure for mean deviation is given as, Coefficient of MD = MD/Mean or MD/Median
Standard deviation is also known as root mean square deviation and measures the absolute dispersion or variation of a distribution. Higher standard deviation implies high degree of variability.
Following formulae are used to calculate standard deviation for different distributions-
When deviations are taken from Actual mean (X̅)
Individual series - SD = √((∑x∧2)/N) where x = (X-X̅)
Ungrouped frequency distribution- SD = √((∑f(x∧2))/N) where x = (X-X̅)
grouped frequency distribution- SD=√(∑f(x∧2)/N) where x = (m-X̅) {m is the midpoint of classes in question}
When deviations are taken from Assumed mean (A)
Individual series- SD = √((∑d∧2)/N-((∑d)/N)∧2) where d = (X-A)
Ungrouped frequency distribution- SD=√(∑f(d∧2)/N-((∑fd)/N)∧2) where d = (X-A)
grouped frequency distribution- SD=√(∑f(d∧2)/N-((∑fd)/N)∧2) ×h where d = (X-A)/h
and h = class interval
1. Difficult to compute compared to other measures of dispersion.
Skewness refers to the asymmetry or lack of symmetry in the shape of a frequency distribution. In other words, when a distribution is not in symmetry, it is called a skewed distribution.
A distribution may be positively skewed or negatively skewed.
- When the distribution is positively skewed, then the frequencies (Y-axis) in the distribution are spread out over a greater range of values (on X-Axis) on the high-value end of the curve (right hand side).
When the distribution is negatively skewed, then the frequencies (Y-axis) in the distribution are spread out over a greater range of values (X-axis) on the lower-value end of the curve (left hand side).
Positively skewed Distribution or Skewed towards right whereas a Negatively skewed distribution or skewed to Left. When the curve is normally spread, frequencies will be symmetric and equally distributed on both the sides of the mid-point (or center point) and the mean, median and mode will all be equal. A or symmetrical distribution is shown as follows:
Statistical analysis is most often based on the concept of a bell-shaped symmetrically distributed normal distribution.
In a positively skewed distribution or a distribution that is skewed to the right, the value of mean (X̅) is maximum and value of mode (Mo) is the least. Median (Me) lies in the middle of the two. { X̅> Me > Mo}
In a negatively skewed distribution or a distribution that is skewed to the left, the value of mode (Mo) is maximum, the value of mean (X̅) is the least and Median (Me) lies in the middle of the two. { X̅< Me < Mo}
In a symmetric distribution, the value of mean is equal to the value of mode and is equal to the value of median
Measures of skewness helps in accessing the direction as well as the degree of asymmetry in a given data set. These measures can be absolute as well as relative.
The difference between the mean (X̅) and mode (Mo). If the value of the mean is greater than the mode, then the skewness will be positive (+). Conversely, if the value of mean is less than the value of Mode, then the skewness will be negative (-).
Absolute Sk = X̅ - Mo
Absolute skewness can also be expressed in terms of quartiles:
Absolute Sk = Q1+Q3-2Me
Relative skewness involves three measures:
1. The Karl Pearson’s coefficient of Skewness (SKp): This relative measure of skewness is based upon the Absolute measure of skewness which is the difference between mean and mode. when this difference is divided by standard deviation, we get a relative measure. The formula is as follows:
SKp = (Mean-Mode)/(Standard deviation)
The above formula of relative skewness gives direction as well as the degree of skewness. However, the formula cannot be used when a distribution is bimodal. In this case we use the following relation between mean, median and mode-
Mode=3Median-2Mean
Using the above relation, we get the following formula for calculating relative skewness-
SKp=(3(Mean-Median))/(Standard deviation)
2. Bowley’s coefficient of skewness (SKb): This measure is based on the concept of quartiles. In a normal or symmetrical distribution, the 1st and the 3rd quartiles are equidistant from the 2nd quantile i.e. the median-
Q3+Q1-2Median=0
The formula for Bowley’s coefficient of skewness is given as-
SKb = (Q1+Q3-2Median)/(Q3-Q1)
Bowley’s measure of skewness is based on the middle 50% of the observations in a data set. The skewness analysis here leaves the top 25% and lowest 25% of the observations.
3. Kelly’s coefficient of Skewness (SKk): Skewness is concerned with the extreme values in a data. Clearly, Bowley’s measure of relative skewness does not include the extreme values. Kelly’s measure of relative skewness is based on 10th and 90th percentile (or 1st and 9th decile). The formula for Kelly’s coefficient of skewness can be shown as follows-
SKk = (P10+P90-2Median)/(P90-P10)
In statistical theory, Kurtosis is a measure of extent of flatness or tailedness in a frequency distribution curve.
Kurtosis of a distribution curve is usually studied relative to a normal (symmetric) distribution curve.
A curve, depending on the degree of flatness, can be mesokurtic (normal distribution curve), leptokurtic or platykurtic.
A distribution is said to be Mesokurtic when the kurtosis of that distribution is same as the kurtosis of the normal distribution. Kurtosis of a univariate normal distribution is 3.
A distribution which has positive excess kurtosis (kurtosis greater than 3). This type of distribution has a curve that is more peaked than the normal distribution curve.
A distribution which has negative excess kurtosis (kurtosis less than 3). This type of distribution has a curve that is less peaked or say, is flatter than the normal distribution curve.
L-LEPTOKURTIC
M-MESOKURTIC
P-PLATYKURTIC
The above diagrammatical description of different regimes of Kurtosis (L, M, P) makes it clear that they differ widely with respect to convexity.
At assignmenthelp we have the best statistics tutors for online homework help with statistics assignments, business statistics essay research writing as well as for statistical analysis related custom research writing work. Our statistics assignment help tutors are well versed in techniques of data analysis including exploratory data analysis as well as advanced statistical analysis techniques of regression, ANOVA, F test, T test, Chi-square test, time series analysis, multiple regression analysis and more. Advanced statistical analysis such as Panel data regression and time series modelling are useful not just for statistics students but also for students studying econometrics, finance, psychology, research methods, biology, medical science and social sciences etc. At assignment help we also provide free statistical analysis tutorials on software like STATA, SPSS, Excel, R and many more.
Assignment Writing Help
Engineering Assignment Services
Do My Assignment Help
Write My Essay Services