Missing Data Handling#
Prerequisites#
Prerequisite |
Importance |
Skill Level |
Notes |
---|---|---|---|
Helpful |
Beginner |
Provides some background on general data handling and good prerequisite coding practices |
Summary#
This chapter aims to introduce missing data handling. Although we live in the age of “big data”, data can often be fragmented, incomplete, and erroneous. The methods we develop and any analysis we conduct can only be as good as the data we provide. So, is there anything we can do about a dataset riddled with missing data?
To answer this question, this chapter will start by defining different types of Missing Data Structures and see how we can visualise these in the Visualising Missingness subchapter. This will help readers develop a strategy on choosing the appropriate missing data handling method, of which a few are outlined in Missing Data Handling Methods. Lastly, another subchapter will introduce introduce the relatively new field of Structured Missingness, pioneered by researchers in the Turing-Roche partnership. This is a big area of research and this chapter is only a simple introduction to it. Therefore, if you are interested in learning more you can read on further by having a look at the Checklist and Resources page.
Motivation and Background#
Missing data can disrupt research and create challenges in results interpretation and in the validity of any conclusions [PMCF+17]. Understanding missing data structures and handling missing data appropriately is important to prevent creating or worsening pre-existing biases of the dataset and to ensure fair, generalizable models [Buu18]. Both missing data itself, or how it is handled, can have a large impact on subsequent analyses. Missing data handling is one step in the process to figuring out the best use of your data, in the most efficient way possible.