Missing Data Handling#

Prerequisites#

Prerequisite

Importance

Skill Level

Notes

Research Data Management

Helpful

Beginner

Provides some background on general data handling and good prerequisite coding practices

Summary#

This chapter aims to introduce missing data handling. Although we live in the age of “big data”, data can often be fragmented, incomplete, and erroneous. The methods we develop and any analysis we conduct can only be as good as the data we provide. So, is there anything we can do about a dataset riddled with missing data?

Cartoon-like sketch with four people assembling puzzle pieces. There are four different buckets with puzzle pieces with blue dots in them. One bucket is labelled "RANDOM" and another "NOT RANDOM". Two of the people are holding a dotted puzzle piece. Another person is looking up at a hanging puzzle. Some pieces of the puzzle are dotted, others are filled with solid orange. The last person standing to the left of the hanging puzzle is painting an empty puzzle piece tile using paint from a bucket labelled "DATA HANDLING". This paint has the same dotted pattern found on the puzzle pieces in the buckets.

Fig. 78 Missing Data Handling. The Turing Way project illustration by Scriberia. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807.#

To answer this question, this chapter will start by defining different types of Missing Data Structures and see how we can visualise these in the Visualising Missingness subchapter. This will help readers develop a strategy on choosing the appropriate missing data handling method, of which a few are outlined in Missing Data Handling Methods. Lastly, another subchapter will introduce introduce the relatively new field of Structured Missingness, pioneered by researchers in the Turing-Roche partnership. This is a big area of research and this chapter is only a simple introduction to it. Therefore, if you are interested in learning more you can read on further by having a look at the Checklist and Resources page.

Motivation and Background#

Missing data can disrupt research and create challenges in results interpretation and in the validity of any conclusions [PMCF+17]. Understanding missing data structures and handling missing data appropriately is important to prevent creating or worsening pre-existing biases of the dataset and to ensure fair, generalizable models [Buu18]. Both missing data itself, or how it is handled, can have a large impact on subsequent analyses. Missing data handling is one step in the process to figuring out the best use of your data, in the most efficient way possible.