Using STATA to compute mean, Standard Deviation, Minimum & Maximum Value Of The Variables

EXAMPLE: Compute the means and the standard deviations of LNWAGE, EDU and EX for the entire sample and then by gender (male/female), by race (white/nonwhite/Hispanic) and by union status (union/ non union). Within each of the three groups sorted by gender, race, and union status, find which subgroup has the highest average LNWAGE and the highest dispersion as measured by the standard deviation. Do the same for EDU.

To get the summary statistics, including means and the standard deviations of LNWAGE, EDU and EX for the given sample, we summarize the variables and the output obtained is as follows:


Stata command:


{`
  summarize variable name
  so here to summarize the variable LNWAGE we have
  summarize LNWAGE
  `}
Stata command

The first column denotes the number of observations in the sample. The second column denotes the mean value of the variable (here the average value of the natural logarithmic of individual hourly wage in dollars (LNWAGE)). Thus on an average an individual’s hourly wage in dollars (taken in natural logarithmic terms) is 2.059181. The third column represents the standard deviation of LNWAGE, which signifies the dispersion in the values of natural logarithmic of hourly wage earnings in dollars from its mean value. Thus, standard deviation being low in this case implies that the value of LNWAGE is close to its mean value.

The next column gives the minimum value of natural logarithmic of individual’s hourly wage earning in dollars, which is 0 in this sample. It means that the minimum hourly wage earnings is $1 (because log(1)=0).

The last column gives the maximum value of variable being considered (LNWAGE), which is 3.7955 in the given sample. It in turn implies that the maximum hourly wage earnings is $6244.5335 (because log(6244.5335)=3.7995).

Similarly we can summarize other variables in the exercise as well:

{`
  summarize EDU
  `}
summarize EDU

The first column denotes the number of observations in the sample. The second column denotes the average years of schooling (EDU) based on the given sample. Thus on an average, an individual attends school for about 13 years. The third column represents the standard deviation of EDU, which signifies the dispersion in the years of schooling from its mean value.

The next column gives the minimum years of schooling, which is 2 years in the given sample.

The last column gives the maximum value of variable being considered (EDU), which is 18 years in the given sample. Thus, at most an individual attains education for about 18 years.

{`
  summarize EX
  `}
summarize EX

The first column denotes the number of observations in the sample. The second column denotes the average potential years of experience (EX) based on the given sample. Note that a potential year of experience is the age of an individual less years of schooling and a numeral 6. Thus on an average, the potential years of experience that an individual possess is about 17 years.

The third column represents the standard deviation of EX, which signifies the dispersion in the potential years of experience from its mean value. Thus, there is indeed large difference in the potential years of experience of an individual from the average value.

The next column gives the minimum potential years of experience possessed by an individual, which is 0 years in the given sample. This implies that there are some fresher’s in the sample.

The last column gives the maximum potential years of experience (EX), which is 55 years in the given sample. Thus, at most an individual posses potential experience as high as 55 years.

COMPUTING THE MEAN AND STANDARD DEVIATION OF ONE VARIABLE, SORTED BY ANOTHER VARIABLE

Computation of means and standard deviation of LNWAGE by Gender (male/female)

We now compute the averages and standard deviation, sorted by gender, race and union status. Herewith, the commands are mentioned in bold letters. FE represents the dummy variable which takes the value 0 for female and 1 for male. The first table gives the summary statistics of LNWAGE for females and second table gives the summary statistics for males.

Thus, we will give the command for this in STATA in two steps: in the first command, we will sort the dummy variable and then we will summarize the variable of our choice by that sorted variable.

{`
  Sort FE
  By FE: su LNWAGE
  `}
FE = 0

Sort FE 0
FE = 1

Sort FE 1

Interpretation of STATA output

Based on the output obtained, we can infer that average LNWAGE is higher for males than females, thus on average male does get higher hourly wage and the dispersion in LNWAGE is also greater for males than females. In other words, the variation in LNWAGE form its average value is slightly greater in males than females. Also, interesting to note that in the given sample, the female get minimum hourly wage is higher than females and same is true for the maximum. Thus, females derive highest salaries in our sample vis-a-vis males.

Computation of means and standard deviation of LNWAGE by Race (White/Non-White/Hispanic)

After gender, we then sort the summary statistics based on race. The dummy variable Non-White and Non-Hispanic (NONWH) takes the value 1 for Non-White and Non-white and Non-Hispanic and 0 otherwise. The summary statistics sorted by race is given as follows along with the commands in bold.

{`
  Sort NONWH
  By NONWH: su LNWAGE
  `}
NONWH = 0

Sort NONWH 0
NONWH = 1

Sort NONWH 1

Interpretation of STATA output

Based on the above outcome, we note that the two sub samples based on race differ greatly in terms of their number of observations. There are just 67 people, who are neither non-white nor non-Hispanic. However the other sample is as large as almost 7 times the former. We observe that the mean hourly wage of either white or Hispanics is greater that those who are neither white nor Hispanic. The minimum hourly wages of the latter is quite higher than the former. Also, these are the whites or Hispanics in highest paid jobs getting maximum salaries.

Computation of means and standard deviation of LNWAGE by Union Status (Union/Non-Union)

Lastly, we sort the summary statistics of LNWAGE based on Union Status. The dummy variable named Individual workers in union jobs (UNIO) take the value 1, when a person works in a union job and 0 otherwise. Below is the output obtained in STATA with commands listed in bold.

{`
  Sort UNIO
  By UNIO: su LNWAGE
  `}
UNIO = 0

Sort UNIO 0
UNIO = 1

Sort UNIO 1

Interpretation of STATA output

The above table can be interpreted as: Firstly, we find that there are quite a few people working in union jobs and majority of population works in non-union jobs. However, the average hourly wage is higher in union jobs with lesser variability from the mean. As expected, the minimum hourly wages of union workforce is on a higher side than non-union workforce. However, it is the non union worker who derives topmost hourly wage and is highest paid.

Computation of means and standard deviation of EDU by Gender (male/female)

We then sort the summary statistics of years of schooling (EDU) based on gender, race and union status. First sorting is based on gender, which is a dummy variable that takes the value 1 for female and 0 for male. The STATA output along with the commands used are mentioned in bold as follows:

{`
  Sort FE
  By FE: su EDU
  `}
FE = 0

Sort FE EDU 0
FE = 1

Sort FE EDU 1

Interpretation of STATA output

Interpreting the output on similar lines as above, we find that average years of schooling is more or less same for females and males, thus there is apparently no gender bias in education. The dispersion in years of schooling as measured by standard deviation is greater among males than females. The females have a higher minimum education than males and the maximum years of schooling in the sample is attained by both males and females.

Computation of means and standard deviation of EDU by Race (White/Non-White/Hispanic)

The summary statistics of years of schooling is sorted on the basis of race. The race variable is depicted by the variable NONWH which is a categorical variable. The variable NONWH is coded as taking the value 1 for those who are non-white and non-hispanics and 0 otherwise.

The STATA output based on the given sample is as follows:

{`
  Sort NONWH
  By NONWH: su EDU
  `}
NONWH = 0

Sort NONWH EDU 0
NONWH = 1

Sort NONWH EDU 1

Interpretation of STATA output

It is the whites or Hispanics with higher average years of schooling, the difference is though marginal (0.43 years) and the dispersion in years of schooling are almost same for both the groups (though it is slightly more for whites or Hispanics by 0.01). The non-whites and non-Hispanics has a higher minimum years of schooling and both the groups have people who have attained maximum years of schooling in the sample being considered.

Computation of means and standard deviation of EDU by Union Status (Union/Non-Union)

Lastly, sorting of summary statistics of years of schooling is done based on the union status. The union status is depicted by the variable UNIO which is a categorical variable. The variable UNIO is coded as taking the value 0 for those who belong in the Union jobs and the value of 1 for those who are in the non-union job.

The STATA output and the commands are as follows:

{`
  Sort UNIO
  By UNIO: su EDU
  `}
UNIO = 0

Sort UNIO EDU 0
UNIO = 1

Sort UNIO EDU 1

Interpretation of STATA Output

The average years of schooling of workers is almost same in both the union and non-union jobs with marginal difference of about 0.16 years (biased towards non-union job). Same is true for dispersion as measured by standard deviation with difference of about 0.02 (biased towards union jobs). However, workers in union jobs have higher minimum years of schooling vis-à-vis non-union jobs.

Dummy Variable in Multiple Regression STATA Tutorial

  • Using STATA to compute mean
  • OLS Multiple Linear Regression In STATA
  • OLS Regression with Dummy Variable in STATA
  • Linear Regression Without Constant
  • Regression in STATA with Indicator Variables