Aug 20, 2012

Replicating simple analyses with -corr2data-

A correlation table is a pivotal feature of academic papers in many disciplines such as psychology. The reason for this is that many basic statistical methods rely on correlation matrices only. Thus, the vector of means and standard deviations plus the correlation matrix and the sample size are sufficient to replicate many standard analyses such as OLS regression or principal component analysis. -corr2data- allows to do so by creating an artficial data set based on the correlation matrix (or, covariance matrix), means, standard deviations, and sample size, which can then be used for replicating or modifying the analyses.

Table 1 of Huff-Corzine et al. (1986)

In order to try out -corr2data-, we will replicate a randomly chosen research paper, namely Huff-Corzine et al. (1986). Huff-Corzine and colleagues deal with the North–South differences in homicide in the US, and, more importantly for our purposes, they provide a full correlation table and conduct OLS regression analyses.


First, we need to read in the correlation matrix as reported in Figure 1. (N.B. Stata 12 now comes with the -ssd- command which might allow for a more convenient way of reading in summary statistics.)
version 12
clear
capture which estout
if _rc ssc install estout

// Read in correlation table
#delimit ;
input str30 varnames double (a1-a9);
"1 Homicide rate"
    1.00  0.91  0.83  0.88  0.43 -0.06 -0.30  0.65  0.83;
"2 Structural poverty index"
    0.91  1.00  0.78  0.90  0.40 -0.04 -0.17  0.65  0.84;
"3 Southernness index"
    0.83  0.78  1.00  0.72  0.34  0.14 -0.44  0.71  0.93;
"4 Perc nonwhite"
    0.88  0.90  0.72  1.00  0.32  0.08 -0.14  0.68  0.76;
"5 Perc ages 20-34"
    0.43  0.40  0.34  0.32  1.00 -0.14 -0.21 -0.09  0.32;
"6 Perc rural"
   -0.06 -0.04  0.14  0.08 -0.14  1.00  0.14  0.28  0.10;
"7 Hospital beds/100K"
   -0.30 -0.17 -0.44 -0.14 -0.21  0.14  1.00 -0.20 -0.30;
"8 Gini index"
    0.65  0.65  0.71  0.68 -0.09  0.28 -0.20  1.00  0.73;
"9 Perc born in South"
    0.83  0.84  0.93  0.76  0.32  0.10 -0.30  0.73  1.00;
end;
#delimit cr

// Convert to a matrix called M
mkmat a1-a9, matrix(M) 
matrix list M

// Read in variable names
levelsof(varnames), local(names)

local ednames = ""
  // Create local "ednames" that includes all variable names,
  // use function -strtoname()- to bring variable names into 
  // proper Stata format
foreach x of local names {
  *di strtoname("`x'", 1)
  local ed_name1 = strtoname("`x'", 1)
  *di "`ed_name1 '"
  local ed_name2 = substr("`ed_name1'", 4, .)
  *di "`ed_name2 '"
  local ednames = "`ednames'" + " " + "`ed_name2'"
}
In this step, we create the artificial data set based on the correlation matrix and means and s.d.'s:
// Create data based on correlation matrix M, vectors of
// means and sd's, specify no. of cases and variable names
corr2data `ednames', n(48) clear ///
     corr(M) ///
    means(7.12 7.12 17.71 10.42 19.75 34.22 7.81 .38 7.18) ///
   sds(4.25 3.88  9.22  8.85  1.27 14.37 1.57 .02 1.44)
One shouldn't underestimate the importance of fine-looking labels in a data set:
// Create nice-looking value labels based on the variable names
local varlabels = ""

foreach var of varlist _all {
  local varname = "`var'"
  *di "`varname'"
  local varname = subinstr("`varname'", "_"   , " ", .)
  *di "`varname'"
  local varname = subinstr("`varname'", "Perc", "%", .)
  *di "`varname'"
  lab variable `var'  "`varname'"
  local varlabels = "`varlabels'" + " " + "`varname'"
}
A brief check shows that every thing seems to be alright:
corr _all



tabstat _all, statistics(mean sd)

However, replicating the analyses yields different results:
eststo clear

// Run regression

eststo: regress Homicide_rate ///
          Structural_poverty_index Southernness_index ///
    Perc_nonwhite Perc_ages_20_34 Perc_rural ///
    Hospital_beds_100K Gini_index, beta  
    
eststo: regress Homicide_rate ///
          Structural_poverty_index ///
    Perc_nonwhite Perc_ages_20_34 Perc_rural ///
    Hospital_beds_100K Gini_index Perc_born_in_South, beta
    
estadd beta: est1 est2

esttab, r2 nonumbers mlabels("South index" "% born South") ///
        cells ((b(fmt(3) star) beta(fmt(3)))) label not)

Download the do-file here.

These results differ substantially from those reported by Huff-Corzine and others (see below), however I have a feeling that I should rather be trusting my own result.

Table 2 of Huff-Corzine et al. (1986)

A robustness check offered by -corr2data- is the -seed(#)- option, which allows for generating different artificial data sets. Specifying different seeds and comparing the results can serve as a check whether the summary statistics are really sufficient for replicating an analysis. Using different seeds in our example always leads to the same result.


Reference

Huff-Corzine, Lin, Jay Corzine, and David C. Moore. 1986. "Southern Exposure. Deciphering the South's Influence on Homicide Rates." Social Forces 64(4):906-924. doi: 10.1093/sf/64.4.906