TY - JOUR
T1 - The effect of sampling design and response mechanism on multivariate regression-based predictors
AU - Pfeffermann, Danny
PY - 1988/9
Y1 - 1988/9
N2 - A general regression procedure for the prediction of a vector of population means in situations of nonresponse is proposed. The multivariate treatment of the prediction problem is not computationally complicated and allows the borrowing of information from one target variable to the other even when both variables are subject to nonresponse. The predictors are optimal under a general model that specifies the first and second moments of the joint distribution of the survey variables. The effects of the sampling design and response mechanism on the properties of the predictors are investigated, and appropriate modifications and bias corrections that use the sample inclusion probabilities are proposed. The performance of the predictors is illustrated empirically and compared with that of other predictors using simulated and real data. Data collected in governmental and other large-scale surveys is almost always incomplete. Typical causes for the incompleteness of the data are delays in attaining parts of the information, refusals to answer certain questions, and exclusion of erroneous data. Another feature characterizing many surveys is that the sampling design and, in particular, the sample inclusion probabilities are determined by the values of design variables that are correlated with the survey target variables. The survey of industrial establishments in Israel is an example for the use of such a complex sampling scheme. Typical response patterns observed for this survey are presented in Table 1. Data collected in the survey were used for the empirical study described in Section 5. In this article I present a multivariate regression procedure that can be used to predict simultaneously the finite population means of p related survey variables. A notable feature of this procedure is that the imputations of the missing data for any given unit use all of the information known for that unit, including observations on variables that themselves are missing for other units. The implementation of the procedure requires the estimation of the variance-covariance matrix of the survey variables, and this can be done efficiently and in a relatively simple way by use of the EM algorithm described by Beale and Little (1975). The multivariate procedure can be modified to deal with situations where the sample inclusion probabilities and the probabilities of nonresponse depend on the measured values of design and covariate variables. The result of this dependence is that the distribution of the sample observations of the survey variables is different from the distribution in the population, which causes a bias in the unmodified predictors. The modification consists of weighting the observations in the EM algorithm by the inverse of the units’ inclusion probabilities and subtracting an estimate of the prediction bias, obtained by traditional sampling theory, from the original predictors.
AB - A general regression procedure for the prediction of a vector of population means in situations of nonresponse is proposed. The multivariate treatment of the prediction problem is not computationally complicated and allows the borrowing of information from one target variable to the other even when both variables are subject to nonresponse. The predictors are optimal under a general model that specifies the first and second moments of the joint distribution of the survey variables. The effects of the sampling design and response mechanism on the properties of the predictors are investigated, and appropriate modifications and bias corrections that use the sample inclusion probabilities are proposed. The performance of the predictors is illustrated empirically and compared with that of other predictors using simulated and real data. Data collected in governmental and other large-scale surveys is almost always incomplete. Typical causes for the incompleteness of the data are delays in attaining parts of the information, refusals to answer certain questions, and exclusion of erroneous data. Another feature characterizing many surveys is that the sampling design and, in particular, the sample inclusion probabilities are determined by the values of design variables that are correlated with the survey target variables. The survey of industrial establishments in Israel is an example for the use of such a complex sampling scheme. Typical response patterns observed for this survey are presented in Table 1. Data collected in the survey were used for the empirical study described in Section 5. In this article I present a multivariate regression procedure that can be used to predict simultaneously the finite population means of p related survey variables. A notable feature of this procedure is that the imputations of the missing data for any given unit use all of the information known for that unit, including observations on variables that themselves are missing for other units. The implementation of the procedure requires the estimation of the variance-covariance matrix of the survey variables, and this can be done efficiently and in a relatively simple way by use of the EM algorithm described by Beale and Little (1975). The multivariate procedure can be modified to deal with situations where the sample inclusion probabilities and the probabilities of nonresponse depend on the measured values of design and covariate variables. The result of this dependence is that the distribution of the sample observations of the survey variables is different from the distribution in the population, which causes a bias in the unmodified predictors. The modification consists of weighting the observations in the EM algorithm by the inverse of the units’ inclusion probabilities and subtracting an estimate of the prediction bias, obtained by traditional sampling theory, from the original predictors.
KW - Imputation
KW - Informative samples
KW - Missing at random
KW - Noninformative samples
KW - Normal EM algorithm
KW - Prediction mean squared error
KW - Pξ distribution
KW - Weighted EM algorithm
UR - http://www.scopus.com/inward/record.url?scp=0042372638&partnerID=8YFLogxK
U2 - 10.1080/01621459.1988.10478670
DO - 10.1080/01621459.1988.10478670
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:0042372638
SN - 0162-1459
VL - 83
SP - 824
EP - 833
JO - Journal of the American Statistical Association
JF - Journal of the American Statistical Association
IS - 403
ER -