Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Studies can be designed to be robust to missing data, such that any missingness will have a limited effect and bias on any results. Frequently, this is not possible. Thus, this sub-chapter provides an overview of some missing data handling methods. Additionally, discussing these methods can quickly get very technical; we will do our best to avoid using jargon and leave the technical explanations to the great, already existing, resources out there, linked in Checklist and Resources.

Deletion

The simplest way of handling missing data would be to remove any rows or data entries that have missing values in any variables of interest. This is also known as listwise deletion and complete-case analysis Oluwaseye Joel et al., 2022. However, this may greatly reduce the available data for analyses, introduce or increase bias, and decrease representation Woods et al., 2024. Nevertheless this is an appropriate option MCAR data.

Pairwise deletion, or available case analysis, only involves removing data entries as the need arises Pigott, 2001. This method does allow using more of your data, but has a lack of consistency in results and can result in one or more variable being explained by others (via a linear combination).

Imputation

Imputation refers to “filling in” any missing values. Many ways of imputing missing data exist:

Flowchart about Missing Data Handling Techniques, adapted from Joel and others, 2022. The diagram presents three main strategies: Deletion, Imputation, and Built-in. Under Deletion, two methods are listed: Listwise Deletion and Pairwise Deletion. The Imputation branch splits into five subtypes: Single Imputation (for example, mean, median, mode, LOCF-Last Observation Carried Forward, NOCB-Next Observation Carried Backward), Multiple Imputation, Model-Based Imputation (for example, regression, k-nearest neighbor, hot-deck, maximum likelihood, expectation maximization), Machine Learning Imputation (for example, artificial neural networks, deep learning), and Optimization Algorithm Imputation (for example, genetic algorithm, particle swarm optimization, matrix completion). The Built-in branch includes Decision Tree Methods.

Figure 1:Diagram of missing data handling techniques, created by Joel et al. Oluwaseye Joel et al., 2022. Abbreviations: LOCF - Last Observation Carried Forward, NOCB - Next Observation Carried Backward.Used under a CC-BY 4.0 licence. DOI: 10.15157/IJITIS.2022.5.3.971-1005.

Single imputation involves imputing a single value per missing value. These methods are easy to implement and suitable when there are a small number of missing values in a large dataset.

In contrast, multiple imputation methods can generate several values for each missing value. This means that multiple imputation methods can also have an average value and variance assigned to each imputed value, thereby giving the option to test the stability of downstream analyses. One popular multiple imputation method is Multivariate Imputation by Chained Equations (MICE), which is explained below.

The more complicated imputation methods, such as those under the model-based, machine learning and optimization algorithm imputation headings tend to perform better when there are many missing values and can handle many types of missing data. These are harder to implement and usually require very large sample sizes to produce consistent results.

Great care should be taken with imputation methods, as they may introduce or further amplify existing bias. For instance, in our sample dataset, those with the most severe cognitive decline were unable to attend the follow-up visit to get their cognition assessed. If their values were imputed using everyone else’s, their values would be considerably lower than in reality.

MICE

MICE works iteratively to impute data. It uses the most complete data to inform the values of increasingly less complete entries. This is done iteratively in cycles, such that at the end of each cycle there is one set of imputed values Buuren & Groothuis-Oudshoorn, 2011. A slightly more detailed explanation follows below.

The first step of a cycle involves filling in the missing values of all variables, using a simple method. Then, the variable with the least number of missing values are set back to missing. The observed values of that variable are regressed on variables in the rest of the dataset. Imputed values are then estimated using this regression model, and the missing values are replaced Azur et al., 2011. A different type of regression can be used for each variable, such that each variable is handled separately and can be assigned a unique distribution. This imputation also includes some randomness to capture the uncertainty of the imputed Wulff & Jeppesen, 2017. Next, the variable with the 2nd least number of missing values are set back to missing and imputed as described previously, and so on. This is done iteratively, even after all missing values have been imputed. At the end of a set number of iterations over all variables (for example, 50 iterations), the cycle is complete and each missing value now has an unbiased estimated imputed value.

Often, MICE is done repeatedly, such that an average value and variance are obtained per imputed value. Therefore, 5 cycles may be completed, so that there are 5 imputed values per missing data point. Simulations can also be used to estimate the performance of MICE in a given dataset, and determine whether values are imputed within a certain tolerance. Another method of evaluating the imputation is by comparing the distributions of the original and the imputed data. If the distributions are quite similar, then the imputed values match.

Summary

Indeed, this is a very vast area of research, with many proposed methods available; if you are interested in finding out more please see Checklist and Resources. Hopefully, this sub-chapter acts as a useful starting point!

References
  1. Oluwaseye Joel, L., Doorsamy, W., & Sena Paul, B. (2022). A Review of Missing Data Handling Techniques for Machine Learning. International Journal of Innovative Technology and Interdisciplinary Sciences, 5(3), 971–1005. 10.15157/IJITIS.2022.5.3.971-1005
  2. Woods, A. D., Gerasimova, D., Van Dusen, B., Nissen, J., Bainter, S., Uzdavines, A., Davis-Kean, P. E., Halvorson, M., King, K. M., Logan, J. A. R., Xu, M., Vasilev, M. R., Clay, J. M., Moreau, D., Joyal-Desmarais, K., Cruz, R. A., Brown, D. M. Y., Schmidt, K., & Elsherif, M. M. (2024). Best practices for addressing missing data through multiple imputation. Infant and Child Development, 33(1), e2407. https://doi.org/10.1002/icd.2407
  3. Pigott, T. D. (2001). A Review of Methods for Missing Data. Educational Research and Evaluation, 7(4), 353–383. 10.1076/edre.7.4.353.8937
  4. van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1–67. 10.18637/jss.v045.i03
  5. Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: what is it and how does it work? [Journal Article]. Int J Methods Psychiatr Res, 20(1), 40–49. 10.1002/mpr.329
  6. Wulff, J. N., & Jeppesen, L. E. (2017). Multiple imputation by chained equations in praxis: Guidelines and review. Electronic Journal of Business Research Methods, 15(1), 41–56.
  7. White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice [Journal Article]. Stat Med, 30(4), 377–399. 10.1002/sim.4067