ECON 4260 econometrics

Empirical Assignment 1: Racial Inequalities in Birth weight in the US

This assignment asks you to prepare a small data set to explore descriptively the impact of the financial crisis on racial inequalities in birth weight in the US, as well as the impact of maternal smoking using genuine birth certificate data from the Center for Disease Control (CDC). US Vital Statistics Data are publicly available on-line at the CDC. The main purpose of this exercise is to introduce you to the exploratory work (data manipulation and basic descriptive work) preceding any empirical (cross-sectional) analysis. The data set produced will be used in a second (follow up) assignment focusing on multivariate analysis.

All your work must be done in Stata (R and Panda will also be accepted). Please Submit Your Coursework through Moodle. Your submission should preferably only consist of 1 (do-it-all) do-file. As shown in tutorial, your do-file should allow me to replicate your results effortlessly and must produce a log file in which all results will be reported.

It is difficult to accurately identify the period during which the financial crisis could have potentially impacted individual health. The crisis is believed to have followed the bursting of the US housing bubble, which peaked at the end of 2006. The recession and associated financial crisis “officially” began in December, 2007, and “officially” ended in June, 2009. To make our lives a little easier, I have produced a 5% pooled sample of Vital Statistics data for the years 2005 (in utero pre-financial crisis), 2008 (in utero during the financial crisis) and 2014 (in utero post-financial crisis) in Stata 13 format with a few control variables.

Download the data directly from my Dropbox by using the following Stata command:

use "", clear

I have done the most tricky data manipulation myself. The codebook of derived variables is provided at the end of this document. A Few variables, however, are still in their original format including dbwt precare, previs, previs_rec, cig_0-cig_3 (cig_0 is cigs before 2009), meduc,wtgain gestrec10 apgar. The codebooks describing these variables can be downloaded from my Dropbox by clicking here for 2005,2008 and 2014 or by accessing User’s Guide directly from the CDC website.

  1. Simple steps to data preparation
    • Our analysis explores the birth weight of Non-Hispanic Whites (NHW) and NonHispanic Blacks (NHB). Drop from the sample all mothers who do not belong to the “ethnic/racial” groups listed above. We assume for this work that an infant’s racial group is defined by its mother only (racem). Rename racem as race.
    • Use this variable to create 2 dummy variables (a dummy variable is dichotomous variable taking the value 1 or 0) for each group and name them white for NonHispanic Whites, black for Non-Hispanic Blacks.
    • Generate a dummy variable, boy, which equals 1 if the infant is a boy and zero otherwise.
    • Parental education is a potential determinant of birth weight. For this assignment, we will simply consider this association with maternal birth weight. The level of education of the mother, meduc, is also a categorical variable. For many mothers, the educational achievement was either Not on certificate or unknown. Replace the assigned values for Not on certificate and unknown by a missing value. I remind you that a missing value in Stata is captured by a dot “.”.

Many Stata commands do not include observations with missing values in the analysis. You should know, however, that a missing value is also treated by Stata as a very large number! Be careful!. This implies that the following statement:

gen smoker = (cig_1>0)

would misleading code all observations with missing values for cig_1 as a smoker. A more accurate statement would be:

gen smoker = (cig_1>0 & cig_1!=.).

  • Generate a new categorical variable for education, educ, equal to 1 if the mother did not complete high school, 2 if the mother has either completed high school or has some college credits, 3 if she completed a degree (AA, AS, BA, MA, PhD). Make sure not to assign a value 0 to mothers whose educational attainment is missing (remembering that Stata treats missing values as a large number).
  • Generate three dummy (dichotomous) variables for each educational level, where primary =1 if the educational attainment of the mother is less than high school (and 0 otherwise), secondary=1 if educ == 2 (and zero otherwise), and tertiary=1 if educ == 3 (and zero otherwise).
  • You will confirm from the data dictionary, that the number of cigarettes smoked daily before (cig_0) and during pregnancy (cig_1-cig_3) is sometime either Unknown or not stated or not on the certificate. These numeric codes must again be replaced by a missing value. Use cig_1-cig_3 to generate a dichotomous variable for each pregnancy trimester and name it smoke_i (where i ∈ 1,2,3 ) taking the value of 1 if the mother smoked in trimester i and zero otherwise.
  • Generate a dummy variable, smoker, if the mother reported smoking in any trimester of her pregnancy.
  • Generate a categorical variable, crisis, taking the value 1 if childbearing occurred before the financial crisis, 2 if it occurred during the financial crisis and 3 if it occurred after the financial crisis. Label it with label var. Define and assign value labels to each level, 1 “Before” 2 “During” and 3 “After” using the command label define and label values.
  • Recode the numerical value 99 of previs by a missing value, since 99 is the code for “unknown or not stated”. See the codebooks for more details. Rename previs as prenatal
  • Finally, the reported birth weight (dbwt) of a few infants is “Not stated”. You will notice from the codebook that infant’s birth weight (dbwt) is equal to 9999 for Not stated birth weight. Again, make sure to assign a missing value when birth weight is not stated.
  • Save a final analytical sample which only includes the following variables: dbwt,mager, race, white, black, married, cig_0, boy, educ, smoke_1, smoke_2, smoke_3, smoker, crisis, primary, secondary, tertiary, prenatal and name it us_dbwt_2019.dta
  1. Questions: Descriptive statistics

Answers to these questions must be reported in a log file produced by your do-file. All Tables, regression results and comments must appear in the log file.

Many questions below ask you to report your results in a table. Professional looking tables can easily be produced in Stata using the command esttab in combination with the very versatile user-command eststout written by Ben Jann (University of Bern). Visit these three websites to become better familiar with this command: Creating Publication-Quality Tables in Stata, Data Analysis with Stata, the original page from Ben Jann estpost – Posting results from non-eclass commands and estout.

  • Produce a table of descriptive statistics from your sample reporting the mean, standard deviation and the number of observations for the variables dbwt mager white black married primary secondary tertiary smoker smoke_1 smoke_2 smoke_3 prenatal. In order to produce this table, you would use the exact syntax provided in the template.
  • Reproduce a similar table only reporting the mean and associated standard error (not the standard deviation) of the variables dbwt mager married primary secondary tertiary by racial groups. For this question, do not use estpost summarize as it does not produce a standard error.
  • Replicate the calculation of the mean birth weight for each racial group using the simple linear regression model. This would imply that you run a regression separately for each group.
  • Produce a table reporting the mean and associated standard error of birth weight before, during and after the financial crisis. Do you find suggestive evidence supporting the existence of an association between birth weight and the financial crisis
  • Investigate the association between birth weight and maternal smoking during pregnancy by contrasting the average birth weight of infants of mothers who do not smoke with: i) infants of mothers who smoked before pregnancy but not during pregnancy and, ii) infants of mothers who smoked before and during pregnancy. What is the association between birth weight and maternal smoking? Briefly explain. Note for this question, I am not asking a formal statistical test. This could however be done and would definitely be better!


Variables which have been slightly changed from the original data.

racem and racef A lot of work to produce this variable. I am simply giving the label definition.

label define race 1 "Non-Hispanic White" 2 "Non-Hispanic Black" ///

3 "Mexican" 4 "Other Hispanic" 5 "Hispanic Black" 6 "Asian" 7 "American Indian / Alaskan

8 "Mixed / Unidentified" label val racem race

meduc (original from the user’s guide)

label var meduc "Mothers Education" label define meduc -1 "Not on certificate" 1 "8th grade or less" ///

2 "9th through 12th grade with no diploma" 3 "High school graduate or GED completed" ///

4 "Some college credit, but not a degree" 5 "Associate degree (AA, AS)" /// 6 "Bachelor’s degree (BA, AB, BS)" 7 "Master’s degree (MA, MS)" ///

  • "Doctorate (PHD, EdD) or Professional Degree (MD, DDS, DVM, LLB, JD)" ///
  • "Unknown" 99 "Not stated" label values meduc meduc married

if (‘i’>13) { ren dmar mar


lab def mar 1 "Yes" 2 "No" 9 "Unknown or not Stated" lab val mar mar gen married=(mar==1) label var married "Married" label define married 1 "Married" 0 "Unmarried" label values married married sex

gen isex=(sex=="F") drop sex ren isex sex label var sex "Infant’s sex" label define sex 0 "Male" 1 "Female" label values sex sex year and month

ren dob_mm month label var month "Month of Birth" ren dob_yy year label var year "Year of Birth hospital

gen hospital=(ubfacil==1) label var hospital "Born in hospital" label define hosp 0 "no" 1 "yes" label values hospital hosp


if (‘i’<9) { rename cigs cig_0


forv v=0/3 {

replace cig_‘v’=-1 if cig_‘v’==.

lab def cig_‘v’ 99 "Unknown or not stated" -1 "Not on certificate" lab val cig_‘v’ cig_‘v’