Back to browse results
Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data
Authors: Shelmith Nyagathiri Kariuki, Anthony Waititu Gichuhi, and Anthony Kibira Wanjoya
Source: American Journal of Theoretical and Applied Statistics, 4(3): 192-200; doi: 10.11648/j.ajtas.20150403.26
Topic(s): Data quality
Country: Africa
  Kenya
Published: MAY 2015
Abstract: Missing data poses a major threat to observational and experimental studies. Analysis of data having i gnored missingness results to estimates that are inefficie nt and unbiased. Various researches have been done to determine the best methods of dealing with missing data. The analysis used in these researches involved simulating missin g data from complete data. Missing data are then imputed using the various met hods, and the best method is arrived at by looking at the biasness of the imputed estimates, from the complete data estimates and the magnitude of standard errors. This study a imed at establishing the best method of dealing with missing data, based on the goodness of fit tests. The study made use of da ta from KDHS 2010. The overall rate of missingness was about 80%. The miss ing data mechanism was tested and proved to be MAR. The missing data was then imputed using Expectation Maximization Alg orithm and Multiple Imputation. Later, logistic mod els were fitted to both datasets. Afterwards, goodness of fit tests were ca rried out to determine which of the two methods was the better method for imputing data. These tests were the AIC, Root Mean Square Error of Approximation (RMSEA) and Cox and S nell’s R-Squared. The predictive ability of the two models was also e xamined using confusion matrices and the area under receiver operation curve (AUROC). From these tests, multiple imputation was seen to be the better method of imputation since lo gistic regression model fitted the data better as compared to data imputed using expectation maximization. From the results of the study, the researchers recommend that the type of missingness present in d ata should be examined. If the amount of missing da ta is large, and the data is MAR, then data should be imputed using multiple imputation before any inference are made. The resea rchers suggested more research to be done to determine the maximum rate o f missing data that should be imputed.