| Table 1 of Huff-Corzine et al. (1986) |
In order to try out -corr2data-, we will replicate a randomly chosen research paper, namely Huff-Corzine et al. (1986). Huff-Corzine and colleagues deal with the North–South differences in homicide in the US, and, more importantly for our purposes, they provide a full correlation table and conduct OLS regression analyses.
First, we need to read in the correlation matrix as reported in Figure 1. (N.B. Stata 12 now comes with the -ssd- command which might allow for a more convenient way of reading in summary statistics.)
version 12
clear
capture which estout
if _rc ssc install estout
// Read in correlation table
#delimit ;
input str30 varnames double (a1-a9);
"1 Homicide rate"
1.00 0.91 0.83 0.88 0.43 -0.06 -0.30 0.65 0.83;
"2 Structural poverty index"
0.91 1.00 0.78 0.90 0.40 -0.04 -0.17 0.65 0.84;
"3 Southernness index"
0.83 0.78 1.00 0.72 0.34 0.14 -0.44 0.71 0.93;
"4 Perc nonwhite"
0.88 0.90 0.72 1.00 0.32 0.08 -0.14 0.68 0.76;
"5 Perc ages 20-34"
0.43 0.40 0.34 0.32 1.00 -0.14 -0.21 -0.09 0.32;
"6 Perc rural"
-0.06 -0.04 0.14 0.08 -0.14 1.00 0.14 0.28 0.10;
"7 Hospital beds/100K"
-0.30 -0.17 -0.44 -0.14 -0.21 0.14 1.00 -0.20 -0.30;
"8 Gini index"
0.65 0.65 0.71 0.68 -0.09 0.28 -0.20 1.00 0.73;
"9 Perc born in South"
0.83 0.84 0.93 0.76 0.32 0.10 -0.30 0.73 1.00;
end;
#delimit cr
// Convert to a matrix called M
mkmat a1-a9, matrix(M)
matrix list M
// Read in variable names
levelsof(varnames), local(names)
local ednames = ""
// Create local "ednames" that includes all variable names,
// use function -strtoname()- to bring variable names into
// proper Stata format
foreach x of local names {
*di strtoname("`x'", 1)
local ed_name1 = strtoname("`x'", 1)
*di "`ed_name1 '"
local ed_name2 = substr("`ed_name1'", 4, .)
*di "`ed_name2 '"
local ednames = "`ednames'" + " " + "`ed_name2'"
}
In this step, we create the artificial data set based on the correlation matrix and means and s.d.'s:
// Create data based on correlation matrix M, vectors of
// means and sd's, specify no. of cases and variable names
corr2data `ednames', n(48) clear ///
corr(M) ///
means(7.12 7.12 17.71 10.42 19.75 34.22 7.81 .38 7.18) ///
sds(4.25 3.88 9.22 8.85 1.27 14.37 1.57 .02 1.44)
One shouldn't underestimate the importance of fine-looking labels in a data set:
// Create nice-looking value labels based on the variable names
local varlabels = ""
foreach var of varlist _all {
local varname = "`var'"
*di "`varname'"
local varname = subinstr("`varname'", "_" , " ", .)
*di "`varname'"
local varname = subinstr("`varname'", "Perc", "%", .)
*di "`varname'"
lab variable `var' "`varname'"
local varlabels = "`varlabels'" + " " + "`varname'"
}
A brief check shows that every thing seems to be alright:
corr _all
tabstat _all, statistics(mean sd)
However, replicating the analyses yields different results:
eststo clear
// Run regression
eststo: regress Homicide_rate ///
Structural_poverty_index Southernness_index ///
Perc_nonwhite Perc_ages_20_34 Perc_rural ///
Hospital_beds_100K Gini_index, beta
eststo: regress Homicide_rate ///
Structural_poverty_index ///
Perc_nonwhite Perc_ages_20_34 Perc_rural ///
Hospital_beds_100K Gini_index Perc_born_in_South, beta
estadd beta: est1 est2
esttab, r2 nonumbers mlabels("South index" "% born South") ///
cells ((b(fmt(3) star) beta(fmt(3)))) label not)
Download the do-file here.
These results differ substantially from those reported by Huff-Corzine and others (see below), however I have a feeling that I should rather be trusting my own result.
| Table 2 of Huff-Corzine et al. (1986) |
A robustness check offered by -corr2data- is the -seed(#)- option, which allows for generating different artificial data sets. Specifying different seeds and comparing the results can serve as a check whether the summary statistics are really sufficient for replicating an analysis. Using different seeds in our example always leads to the same result.