In this problem set you’ll do multiple imputation of the nhanes
data with mice
and Amelia
.
if(!require("mice")) install.packages("mice")
if(!require("Amelia")) install.packages("Amelia")
head(nhanes)
## age bmi hyp chl
## 1 1 NA NA NA
## 2 2 22.7 1 187
## 3 1 NA 1 187
## 4 3 NA NA NA
## 5 1 20.4 1 113
## 6 3 NA NA 184
Your task is to try out the imputation tools and packages we discussed.
Please do the following:
a<-VIM::aggr(nhanes,plot=FALSE) #absolute per variable
summary(a) #missing patterns and their percentage per case
plot(a) #plot patterns
naniar::gg_miss_upset(nhanes) #plot overlaps
marginmatrix(nhanes) #could be sort of MAR, there's some predictiveness in the variables. But probably not super great
mice::md.pattern(nhanes)
chl
) by age
,bmi
and hyp
with listwise deletion of missing values. Note: Look at the variable types in ?nhanes
and think about how to deal with this.lwdm<-lm(chl~age+bmi+hyp,data=nhanes) #listwise deletion
summary(lwdm)
#We might already catch that age and hyp are actually categorical and we should change them; but let's leave them for a second
mice
(let’s use m=5). Plot the imputation progress and decide on the number of iterations, then set maxit
to a value that you think works. Check graphically whether imputation worked well (in the sense of imputing plausible values).imphanes <- mice::mice(nhanes, m = 5, maxit = 10) #maxit=10
plot(imphanes) #looks not great e.g., for mean of bmi (the chains should mix and no trend towards the end)
# so let's ramp up the maxit
imphanes <- mice::mice(nhanes, m = 5, maxit = 50) #maxit=50
plot(imphanes) #looks okay, but note that sd of hyp seems strange
#let's go with maxit=50 or higher
#check imputation goodness
mice::stripplot(imphanes) #looks largely okay, perhaps a tad too many 1s in hyp
nhanes4<-mice::complete(imphanes,4) #fourth imputed set
nh4<-lm(chl~age+bmi+hyp,data=nhanes4) #listwise deletion
summary(nh4)
summary(lwdm)
#quite different!
hyp
make sense? If not, change the imputation method to one that does and rerun imputation.summary(imphanes) #imputation method for all was pmm (predictive mean matching) which is for continuous variables.
# but we noted that hyp is a binary variable. So we can either change hyp in the data frame to factor(hyp) and let it automtaically be chosen or change the imputation method like this:
nhanes$hyp<-as.factor(nhanes$hyp)
nhanes$age<-as.ordered(nhanes$age) #also for age which is ordinal
imphanes <- mice::mice(nhanes, m = 5, maxit = 50, method=c("polr","pmm","logreg","pmm")) #methods are specified in the order of the variables to be imputed; we also list a method (polr) for age even though that is complete
summary(imphanes) #now hyp has been imputed with logreg
lwdm<-lm(chl~age+hyp+bmi,data=nhanes) #refit for now taking the correct variable specification
mf<-with(imphanes,lm(chl~age+hyp+bmi)) #model fitted to all imputed data sets
summary(mice::pool(mf)) #combined estimates from imputed data
summary(lwdm)
#quite different (especially the hyp coefficient is vastly different and age isn't significant from the imputed variables)! listwise doesn't work well here because we don't have MCAR and missings in the dependent variable, so imputation should be better (we alos lost around 50% of the obs in listwise deletion)
imphanes2 <- Amelia::amelia(nhanes, m = 5, ords=c("age"),noms=c("hyp")) #run amelia with ordinal specification for age (not used) and nominal specification for hyp
imphanes2
plot(imphanes2) #ok
imphanes2$imputations #imputed data sets
#fit the model to the imputed data sets
mf2 <- lapply(imphanes2$imputations, function(x) lm(chl ~ age+hyp+bmi, data=x))
mf2
#pool
summary(mice::pool(mf2))
#compare to the others
summary(mice::pool(mf))
summary(lwdm)
#seems like amelia agrees with lwd and mice in different variables
# which result is best to use? No way to check (but listwise deletion is not the right one)
# We'd now play around with the sampler, increase maxit or m and see what happens. At the end WE must make the decision, though, and own it.
Have fun!