Evaluating Household Data Assessment 1

Assessment 1 - Evaluating Household Data

Data Set: Household data

Data set :. This includes information about 2000 households across the following variables.

These are the different variables we have to consider.

There are 15 different variables we will consider for the study.

Tasks for Analysis of Data Set

Task 1

  1. Random sample of size 250.

We have used here random number generation method. Due to this method bias will get reduced and we get better results.

A random number generator (RNG) is a device that generates a sequence of numbers or symbols that cannot be reasonably predicted better than by a random chance.

We have used Uniform random number generation method. So the random variable will be lies between 0 and 1.

  1. Descriptive statistics and boxplot of Alcohol, Meals, Fuel and Phone.

Now we have to find descriptive statistics of each variable in the data set. Also we have to draw boxplot of each variable.

The descriptive statistics are in excel file (sheet 3).

Box plot of Alcohol, Meals, Fuel and Phone :

Box plot of Alcohol, Meals, Fuel and Phone Image 1 Box plot of Alcohol, Meals, Fuel and Phone Image 2
  1. Interpretation of descriptive statistics and boxplot :

From descriptive statistics we can say that,

Average annual expenditures on alcohol in AUD is higher than that of meals, fuel and phone.

"Skewness assesses the extent to which a variable’s distribution is symmetrical.

Kurtosis is a measure of whether the distribution is too peaked (a very narrow distribution with most of the responses in the center).

We can see that skewness coefficient for all the four variables is greator than 0 so distribution is positively skewed.

For Alcohol kurtosis = 2.86 < 3 the distribution is platykurtic.

For meals, fuel and phone the kurtosis coefficients are 8.31, 8.43 and 43.57 respectively which are greator than 3 so the distribution for all three is leptokurtic.

Interpretation of boxplot :

From all the boxplots we can see that some points are outside the boxplot. They seems to be outlier.

In all the four variables there are outliers present in the sample implies population is also contains outliers.

Task 2

  1. Frequency distribution of expenditure of Utilities

Here interest of variable is Utilities.

We have to construct frequency distribution of the expenditures on Utilities.

We have to construct frequency distribution having 11 classes.

The classes are 0-300, 300-600, ............., 2700-3000, More than 3000.

First arrange the data of Utilities in ascending order.

Now we have to find frequency for each class.

Frequency of the class is the number of observations in the particular class.

Class 0-300 contains observations between 0 and including 299.

Class 300-600 contains observations between 300 and including 599 and so on.

In this way we will complete frequency distribution.

The frequency distribution of Utility expenditure is,

Classes

frequency

0-300

16

300-600

33

600-900

51

900-1200

36

1200-1500

38

1500-1800

30

1800-2100

20

2100-2400

10

2400-2700

5

2700-3000

1

More than 3000

10

Totals

250

  1. Different percenatges of households who spend on Utilities
    1. at the most $900 per annum

To find P(Percentage of households who spend on Utilities ≤ $900).

= P(0-300 or 300-600 or 600-900)

= P(0-300 class) + P(300-600 class) + P(600-900 class)

16/250 +33/250 + 51/250 = 100/250 =0.4 = 0.4*100 = 40%

  1. between $1500 and $2700 per annum, and

To find P(Percentage of households who spend on Utilities between $1500 and $2700).

= P(1500-1800 or 1800-2100 or 2100-2400 or 2400-2700)

= P(1500-1800) + P(1800-2100) + P(2100-2400) + P(2400-2700)

30/250 + 20/250 + 10/250 + 5/250 = 65/250 = 0.26*100 = 26%

  1. more than $3000 per annum.

To find P(Percentage of households who spend on Utilities more than $3000).

= P(more than 3000)

= 10/250 =0.04 = 0.04*100 = 4%

Task 3

  1. Top 5% value and the bottom 5% value of the household’s annual after-tax income.

Here our interest of variable is households annual after tax income (AtaxInc).

Let X be the random variable that value of the households annual after tax income.

Here we need to find descriptive statistics for AtaxInc.

From the descriptive statistics :

X ~ N(µ= 60113.04 , σ = 41293.33)

Top 5% we can write symbolically as,

P(X > x) = 5% = 0.05

1 – P(X≤ x) = 0.05

P(X ≤ x) = 1 – 0.05

P(X ≤ x) = 0.95

Now by using EXCEL,

Z = 1.645

Now we can find x by using formula,

X = µ+ z*σ = 60113.04 + 1.645*41293.33 = $128034.5

Thus, your AtaxInc expenditure needs to be $128034.5 or higher .

So 5% of the sample has a expenditure higher than $128034.5

Bottom 5% we can write symbolically as,

P(X < x) = 5% = 0.05

Z = -1.645

X = 60113.04 + 1.645*41293.33 = $-7808.45

Thus, your AtaxInc expenditure needs to be $128034.5 or less .

So 5% of the sample has a expenditure lower than $-7808.45

  1. Type of variable Ownhouse and probability distribution of Ownhouse

Here interest of variable is Ownhouse.

It contains two numbers 1 and 0.

1 : if a household owns a house

0 : if a household doesn’t owns a house

  • Is this a quantitative or a qualitative variable?

This is qualitative variable because yes or no type data is present for Ownhouse.

(ii) What would be the probability distribution of this random variable if we choose randomly (a) Only 1 household? (b) 250 households? Provide any relevant condition(s) to justify your answer.

Let X be a random variable such that X = Number of households who own a house.

It will take two values 1 and 0.

Now we have to find probability for each outcomes.

X

f

P

0

73

0.292

1

177

0.708

250

1

Probability distribution of X is,

x

0

1

total

p

0.292

0.705

1

P(only 1 household) = 1/250 = 0.004

P(250 households) = 250/250 = 1

  1. Scatter plot of ln (Texp) Vs ln(ATaxInc) and type of correlation

Dependent variable y = ln (Texp)

Independent variable x = ln(ATaxInc)

This is the problem of simple linear regression.

By using excel we get following scatter plot.

scatter plot

Correlation coefficient (r) = 0.7145

Correlation coefficient have positive sign so there is positive relationship between two variables.

From the scatter plot we can say that there is positive relationship between natural logarithm of Texp and natural logarithm of ATaxInc.

Task 4

  1. Contingency table of gender and level of education

Here our interest of variable is gender and the level of education.

Gender has two levels male and female.

And level of education (Highest degree) has primary, secondary, intermediate, bachelors and master.

Now we have to complete contingency table of the data.

Highest degree

Gender

P

S

I

B

M

total

Male

25

34

23

23

33

138

Female

24

25

20

26

17

112

total

49

59

43

49

50

250

  1. Probability of male and level of education is intermediate

To find P(male and I)

P(maleandI)= numberofhouseholdsaremaleandlevelofeducationisI/samplesize = 23/250

P(male and I) = 0.092

  1. Probability of female and level of education is Bachelor

To find P(female and B).

P(femaleandB) = numberofhouseholdsarefemaleandlevelofeducationisB/samplesize = 26/250

P(female and B) = 0.1040

  1. Proportion of secondary level of education and male

To find P(S and male).

P(Sandmale) = numberofhouseholdswhoareSandmale/samplesize = 34/250 = 0.1360

  1. Independence of female and level of education is Master degree.

The events are said to be independent iff

P(female * Master degree) = P(female) * P(Master degree)

17/250 = 112/250 * 50/250

17/250 56/625

0.0680 ≠ 0.0896

The events "gender of household head is female" and "having the Master Degree" dependent events.