Aug 30, 2012

-merge- crib sheet

Since Stata 11, -merge- comes with a more precise syntax, distinguishing between different types of matching. The general command structure is as follows:
use master
merge type idvars using using, options 

The so-called "master" file is being matched with the "using" file based on the list of ID variable(s) "idvars". "Type" distinguishes between four types of merging:
  • One-to-one:
    When using -merge 1:1-, Stata merges one observation in the "master" file to the corresponding observation in the "using" file.
    (However, cases that could be matched will nonetheless be included in the merged file.In order to prevent that, the option -assert(match)- has to be added to the command. However, if ID's are not unique, -merge 1:1- will produce an error message.)
  • Many-to-one:
    When using -merge m:1-, Stata merges many observations from the "master" data set to one corresponding obervation in the "using" file. An example for this would be to have individual-level data in the "master" file and country- or household-level information in the "using" file.
  • One-to-many:
    -merge 1:m- is just the reverse of many-to-one; e.g. country-level information in master file is matched to respondent information comprising the "using" file.
  • Many-to-many:
    According to the Stata Data-Management Reference Manual [D], -merge m:m- "is allowed for completeness, but it is difficult to imagine an example of when it would be useful. Use of -merge m:m- is not encouraged."

 

Troubleshooting merges

Still have syntax with the old -merge- command? See the old help file here.

William Gould has the following suggestions for merges gone bad:
  • Check whether the ID variable is stored properly, e.g. if it's a long number it might not be sufficient to store it as a float. Stata might start rounding the long numbers if they are too long for the storage type.
  • Check the uniqueness of ID's in both files to be merged:
  • by id, sort: assert _N == 1
  • Merge on all common variables: If you have doubts about your ID variable, add another variable that should be constant within units, for instance gender.