Structured Missingness#

An alternative way of characterising missing data, known as structured missingness (SM), has been pioneered by researchers of the Turing-Roche Partnership. SM arises in data that is MCAR, MAR or MNAR, and whose missingness has some structure or pattern [MMC+23]. Specifically, standard definitions of missinginess mechanisms (such as those introduced in Missing Data Structures) assume that the missingness of one variable is independent of the missingness in another, when conditioning on the relevant data. In contrast, the missingness of a variable can depend on the data and the missingness of other variables in SM [JMH+23].

This is common in research contexts where data is combined from multiple studies or sources. For instance, many large-scale healthcare studies are multimodal and attempt to include a diverse set of patients, therefore capturing data for a heterogeneous group of individuals. Therefore, data is often collected at multiple time points and multiple sites, where different measurements may be taken, such as clinical, genomic or imaging measures. Our example dataset (introduced in Missing Data Structures) is also an example of SM.

Example: The missing values in the blood test results, blood pressure readings, and cognitive scores are all examples of SM. The blood test results (MCAR) are due to batch failure. The cognitive score missing values (MNAR) are missing in participants with significant cognitive decline. The blood pressure readings (MAR) are missing in participants that could not attend the clinic due to being older and having more motor dysfunction. Therefore, the missingness in all these variables are not equally likely for all individuals, even after conditioning on the relevant data. The missingness has some information that can be leveraged in further analyses and this would be also considered as SM.

Participant Number

Age

Diastolic Blood Pressure

Systolic Blood Pressure

Blood Test Result

Motor Score

Cognitive Score

1

56

82

118

N/A

10

35

2

78

87

134

N/A

32

29

3

85

N/A

N/A

N/A

27

N/A

4

43

83

121

Negative

15

36

5

67

86

131

Positive

20

25

6

82

92

133

Negative

26

N/A

7

88

N/A

N/A

Positive

34

N/A

8

71

N/A

N/A

Negative

33

22

Not all forms of structured missingness can have the same consequences on data. For instance, a complete case analysis may have bias introduced due to the missing blood pressure readings (MAR) and cognitive score (MNAR) values, while the blood test (MCAR) values would not introduce bias, but would decrease statistical power. MAR data may introduce bias by selecting a non-representative subset of the data. This is similarly the case for MNAR, but as the mechanism behind the missingness is not apparent, handling this bias in subsequent analysis can be challenging.

Many datasets, fusing data from multiple sites and modalities, do take care to follow a certain design and data collection process. However, machine learning methods perform best with large datasets. It is common practice for a machine learning model to include data from many studies, often with different designs and variables. Missing values may therefore include information in and of themselves; they may be related to sampling methodologies or reflect population characteristics. Traditional imputation methods, such as those introduced in ref{pd-missing-data-methods}, frequently are not appropriate for handling SM and do not take advantage of the information inherent in SM [MMC+23]. SM also has consequences for downstream analyses; if there is bias to the SM mechanisms, the fairness of the model would be in question. Further research is required to identify appropriate methods for universally handling SM and in defining SM within the MCAR, MAR, and MNAR framework [JMH+23].

Summary#

SM is a new and expanding area of research that aims to improve large-scale statistical and machine learning analyses, by reducing model degradation and bias.