The effect of sampling design and response mechanism on multivariate regression-based predictors

Danny Pfeffermann*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

8 Scopus citations

Abstract

A general regression procedure for the prediction of a vector of population means in situations of nonresponse is proposed. The multivariate treatment of the prediction problem is not computationally complicated and allows the borrowing of information from one target variable to the other even when both variables are subject to nonresponse. The predictors are optimal under a general model that specifies the first and second moments of the joint distribution of the survey variables. The effects of the sampling design and response mechanism on the properties of the predictors are investigated, and appropriate modifications and bias corrections that use the sample inclusion probabilities are proposed. The performance of the predictors is illustrated empirically and compared with that of other predictors using simulated and real data. Data collected in governmental and other large-scale surveys is almost always incomplete. Typical causes for the incompleteness of the data are delays in attaining parts of the information, refusals to answer certain questions, and exclusion of erroneous data. Another feature characterizing many surveys is that the sampling design and, in particular, the sample inclusion probabilities are determined by the values of design variables that are correlated with the survey target variables. The survey of industrial establishments in Israel is an example for the use of such a complex sampling scheme. Typical response patterns observed for this survey are presented in Table 1. Data collected in the survey were used for the empirical study described in Section 5. In this article I present a multivariate regression procedure that can be used to predict simultaneously the finite population means of p related survey variables. A notable feature of this procedure is that the imputations of the missing data for any given unit use all of the information known for that unit, including observations on variables that themselves are missing for other units. The implementation of the procedure requires the estimation of the variance-covariance matrix of the survey variables, and this can be done efficiently and in a relatively simple way by use of the EM algorithm described by Beale and Little (1975). The multivariate procedure can be modified to deal with situations where the sample inclusion probabilities and the probabilities of nonresponse depend on the measured values of design and covariate variables. The result of this dependence is that the distribution of the sample observations of the survey variables is different from the distribution in the population, which causes a bias in the unmodified predictors. The modification consists of weighting the observations in the EM algorithm by the inverse of the units’ inclusion probabilities and subtracting an estimate of the prediction bias, obtained by traditional sampling theory, from the original predictors.

Original languageEnglish
Pages (from-to)824-833
Number of pages10
JournalJournal of the American Statistical Association
Volume83
Issue number403
DOIs
StatePublished - Sep 1988

Keywords

  • Imputation
  • Informative samples
  • Missing at random
  • Noninformative samples
  • Normal EM algorithm
  • Prediction mean squared error
  • Pξ distribution
  • Weighted EM algorithm

Fingerprint

Dive into the research topics of 'The effect of sampling design and response mechanism on multivariate regression-based predictors'. Together they form a unique fingerprint.

Cite this