In this problem set you’ll do multiple imputation of the nhanes data with mice and Amelia.

if(!require("mice")) install.packages("mice")
if(!require("Amelia")) install.packages("Amelia")
head(nhanes)
##   age  bmi hyp chl
## 1   1   NA  NA  NA
## 2   2 22.7   1 187
## 3   1   NA   1 187
## 4   3   NA  NA  NA
## 5   1 20.4   1 113
## 6   3   NA  NA 184

Your task is to try out the imputation tools and packages we discussed.

Please do the following:

a<-VIM::aggr(nhanes,plot=FALSE) #absolute per variable
summary(a) #missing patterns and their percentage per case
plot(a) #plot patterns
naniar::gg_miss_upset(nhanes) #plot overlaps
marginmatrix(nhanes) #could be sort of MAR, there's some predictiveness in the variables. But probably not super great
mice::md.pattern(nhanes) 
lwdm<-lm(chl~age+bmi+hyp,data=nhanes) #listwise deletion
summary(lwdm)
#We might already catch that age and hyp are actually categorical and we should change them; but let's leave them for a second  
imphanes <- mice::mice(nhanes, m = 5, maxit = 10) #maxit=10
plot(imphanes) #looks not great e.g., for mean of bmi (the chains should mix and no trend towards the end)
# so let's ramp up the maxit
imphanes <- mice::mice(nhanes, m = 5, maxit = 50) #maxit=50
plot(imphanes) #looks okay, but note that sd of hyp seems strange
#let's go with maxit=50 or higher
#check imputation goodness
mice::stripplot(imphanes) #looks largely okay, perhaps a tad too many 1s in hyp
nhanes4<-mice::complete(imphanes,4) #fourth imputed set
nh4<-lm(chl~age+bmi+hyp,data=nhanes4) #listwise deletion
summary(nh4)
summary(lwdm)
#quite different!
summary(imphanes) #imputation method for all was pmm (predictive mean matching) which is for continuous variables.
# but we noted that hyp is a binary variable. So we can either change hyp in the data frame to factor(hyp) and let it automtaically be chosen or change the imputation method like this:
nhanes$hyp<-as.factor(nhanes$hyp)
nhanes$age<-as.ordered(nhanes$age) #also for age which is ordinal
imphanes <- mice::mice(nhanes, m = 5, maxit = 50, method=c("polr","pmm","logreg","pmm")) #methods are specified in the order of the variables to be imputed; we also list a method (polr) for age even though that is complete 
summary(imphanes) #now hyp has been imputed with logreg 
lwdm<-lm(chl~age+hyp+bmi,data=nhanes) #refit for now taking the correct variable specification
mf<-with(imphanes,lm(chl~age+hyp+bmi)) #model fitted to all imputed data sets
summary(mice::pool(mf)) #combined estimates from imputed data
summary(lwdm)
#quite different (especially the hyp coefficient is vastly different and age isn't significant from the imputed variables)! listwise doesn't work well here because we don't have MCAR and missings in the dependent variable, so imputation should be better (we alos lost around 50% of the obs in listwise deletion)
imphanes2 <- Amelia::amelia(nhanes, m = 5, ords=c("age"),noms=c("hyp")) #run amelia with ordinal specification for age (not used) and nominal specification for hyp
imphanes2 
plot(imphanes2) #ok
imphanes2$imputations #imputed data sets
#fit the model to the imputed data sets
mf2 <- lapply(imphanes2$imputations, function(x) lm(chl ~ age+hyp+bmi, data=x))
mf2
#pool
summary(mice::pool(mf2))
#compare to the others
summary(mice::pool(mf))
summary(lwdm)
#seems like amelia agrees with lwd and mice in different variables
# which result is best to use? No way to check (but listwise deletion is not the right one) 
# We'd now play around with the sampler, increase maxit or m and see what happens. At the end WE must make the decision, though, and own it.

Have fun!