Jan 2, 2013

Linear interpolation using ipolate

In some instances, it makes sense to fill in gaps in a dataset, e.g. a time-series, using linear interpolation. Stata offers the -ipolate- command for this. In a "long" data set containing the value value, where the variable country identifies the units i and the variable year the time scale j, the command:

by country, sort: ipolate value year, generate(interpolated_value)

creates a new variable interpolated_value that has gaps filled in with linearly interpolated values. However, any first or last observation of a time series will not be imputed, as this is considered to be extrapolation by the -ipolate- command. In order to have these time points filled in as well, the -epolate- option needs to be specified:

by country, sort: ipolate value year, generate(interpolated_value) epolate

A useful additional variable for sensitivity checks is a dummy indicator distinguishing between original and interpolated values:

// Create interpolation indicator
generate interpolated_value_dummy = (value == .)
label define interpolated_value_dummy 0 "Original value" 1 "Interpolated value"
label val interpolated_value_dummy interpolated_value_dummy
label var interpolated_value_dummy "Value interpolation y/n"

A general problem with these interpolated values is that when they are analyzed like complete data, standard errors will be underestimated (Allison 2002) even if the linear interpolations are consistent. An alternative approach (to which, however, the same caveat applies) would be to use Lowess smoothing (Cleveland 1979) via -lowess-.

References

Allison, Paul D. 2002. Missing Data. Sage.
Cleveland, William S. 1979. "Robust Locally Weighted Regression and Smoothing Scatterplots." Journal of the American Statistical Association 74(368):829-836. doi: 10.2307/2286407