Comparison of Methods of Handling Missing Data: A Case Study of KDHS 2010 Data |
Authors: |
Shelmith Nyagathiri Kariuki, Anthony Waititu Gichuhi, and Anthony Kibira Wanjoya |
Source: |
American Journal of Theoretical and Applied Statistics, 4(3): 192-200; doi: 10.11648/j.ajtas.20150403.26 |
Topic(s): |
Data quality
|
Country: |
Africa
Kenya
|
Published: |
MAY 2015 |
Abstract: |
Missing data poses a major threat to observational
and experimental studies. Analysis of data having i
gnored
missingness results to estimates that are inefficie
nt and unbiased. Various researches have been done
to determine the best
methods of dealing with missing data. The analysis
used in these researches involved simulating missin
g data from complete data.
Missing data are then imputed using the various met
hods, and the best method is arrived at by looking
at the biasness of the
imputed estimates, from the complete data estimates
and the magnitude of standard errors. This study a
imed at establishing the
best method of dealing with missing data, based on
the goodness of fit tests. The study made use of da
ta from KDHS 2010. The
overall rate of missingness was about 80%. The miss
ing data mechanism was tested and proved to be MAR.
The missing data
was then imputed using Expectation Maximization Alg
orithm and Multiple Imputation. Later, logistic mod
els were fitted to both
datasets. Afterwards, goodness of fit tests were ca
rried out to determine which of the two methods was
the better method for
imputing data. These tests were the AIC, Root Mean
Square Error of Approximation (RMSEA) and Cox and S
nell’s R-Squared.
The predictive ability of the two models was also e
xamined using confusion matrices and the area under
receiver operation curve
(AUROC). From these tests, multiple imputation was
seen to be the better method of imputation since lo
gistic regression model
fitted the data better as compared to data imputed
using expectation maximization. From the results of
the study, the researchers
recommend that the type of missingness present in d
ata should be examined. If the amount of missing da
ta is large, and the data
is MAR, then data should be imputed using multiple
imputation before any inference are made. The resea
rchers suggested more
research to be done to determine the maximum rate o
f missing data that should be imputed. |
|