Dealing with Missing Data - Practical II: Multiple Imputation

In this problem set you’ll do multiple imputation of the nhanes data with mice and Amelia.

if(!require("mice")) install.packages("mice")
if(!require("Amelia")) install.packages("Amelia")

head(nhanes)

##   age  bmi hyp chl
## 1   1   NA  NA  NA
## 2   2 22.7   1 187
## 3   1   NA   1 187
## 4   3   NA  NA  NA
## 5   1 20.4   1 113
## 6   3   NA  NA 184

Your task is to try out the imputation tools and packages we discussed.

Please do the following:

Diagnose the missing patterns. Check whether the MCAR assumption is viable or whether a MAR assumption might work for the data.

a<-VIM::aggr(nhanes,plot=FALSE) #absolute per variable
summary(a) #missing patterns and their percentage per case
plot(a) #plot patterns
naniar::gg_miss_upset(nhanes) #plot overlaps
marginmatrix(nhanes) #could be sort of MAR, there's some predictiveness in the variables. But probably not super great
mice::md.pattern(nhanes)

Fit the linear analysis model that explains cholesterol (chl) by age,bmi and hyp with listwise deletion of missing values. Note: Look at the variable types in ?nhanes and think about how to deal with this.

lwdm<-lm(chl~age+bmi+hyp,data=nhanes) #listwise deletion
summary(lwdm)
#We might already catch that age and hyp are actually categorical and we should change them; but let's leave them for a second

Impute missing values with mice (let’s use m=5). Plot the imputation progress and decide on the number of iterations, then set maxit to a value that you think works. Check graphically whether imputation worked well (in the sense of imputing plausible values).

imphanes <- mice::mice(nhanes, m = 5, maxit = 10) #maxit=10
plot(imphanes) #looks not great e.g., for mean of bmi (the chains should mix and no trend towards the end)
# so let's ramp up the maxit
imphanes <- mice::mice(nhanes, m = 5, maxit = 50) #maxit=50
plot(imphanes) #looks okay, but note that sd of hyp seems strange
#let's go with maxit=50 or higher
#check imputation goodness
mice::stripplot(imphanes) #looks largely okay, perhaps a tad too many 1s in hyp

Extract the imputed data sets and fit the analysis model to the 4th imputed data set. Compare the results with the results from listwise deletion.

nhanes4<-mice::complete(imphanes,4) #fourth imputed set
nh4<-lm(chl~age+bmi+hyp,data=nhanes4) #listwise deletion
summary(nh4)
summary(lwdm)
#quite different!

Check which imputation method has been used. Does the imputation method for hyp make sense? If not, change the imputation method to one that does and rerun imputation.

summary(imphanes) #imputation method for all was pmm (predictive mean matching) which is for continuous variables.
# but we noted that hyp is a binary variable. So we can either change hyp in the data frame to factor(hyp) and let it automtaically be chosen or change the imputation method like this:
nhanes$hyp<-as.factor(nhanes$hyp)
nhanes$age<-as.ordered(nhanes$age) #also for age which is ordinal
imphanes <- mice::mice(nhanes, m = 5, maxit = 50, method=c("polr","pmm","logreg","pmm")) #methods are specified in the order of the variables to be imputed; we also list a method (polr) for age even though that is complete 
summary(imphanes) #now hyp has been imputed with logreg

Once you’re satisfied with the imputation, fit the analysis model to all imputed data sets and pool the results. Compare the pooled result to the results from listwise deletion. What do you notice?

lwdm<-lm(chl~age+hyp+bmi,data=nhanes) #refit for now taking the correct variable specification
mf<-with(imphanes,lm(chl~age+hyp+bmi)) #model fitted to all imputed data sets
summary(mice::pool(mf)) #combined estimates from imputed data
summary(lwdm)
#quite different (especially the hyp coefficient is vastly different and age isn't significant from the imputed variables)! listwise doesn't work well here because we don't have MCAR and missings in the dependent variable, so imputation should be better (we alos lost around 50% of the obs in listwise deletion)

Impute the data once again but this time with Amelia. Fit the models and pool the results. Compare the results with the pooled result from mice and with the result from listwise deletion.

imphanes2 <- Amelia::amelia(nhanes, m = 5, ords=c("age"),noms=c("hyp")) #run amelia with ordinal specification for age (not used) and nominal specification for hyp
imphanes2 
plot(imphanes2) #ok
imphanes2$imputations #imputed data sets
#fit the model to the imputed data sets
mf2 <- lapply(imphanes2$imputations, function(x) lm(chl ~ age+hyp+bmi, data=x))
mf2
#pool
summary(mice::pool(mf2))
#compare to the others
summary(mice::pool(mf))
summary(lwdm)
#seems like amelia agrees with lwd and mice in different variables
# which result is best to use? No way to check (but listwise deletion is not the right one) 
# We'd now play around with the sampler, increase maxit or m and see what happens. At the end WE must make the decision, though, and own it.

Have fun!