Skip to article frontmatterSkip to article content

This subchapter introduces the main missing data structures. In order to decide on how to best handle any missing data, understanding our data and any context in how the data was acquired is important. To that end, data can be usually classified into three main types based on why the data may be missing:

  1. Missing Completely at Random (MCAR)
  2. Missing at Random (MAR)
  3. Missing not at Random (MNAR)

These were originally proposed by Rubin Rubin, 1976 and are explained in more detail below. As this chapter has been created as part of the Turing-Roche Community Scholar Scheme, the examples provided will be based in healthcare.

Example Dataset: We will be using a fictional study of health outcomes to explain the different mechanisms of missing data. For demonstration purposes, the dataset is first shown (below) as being fully complete and only has 8 participants.

Participant NumberAgeDiastolic Blood PressureSystolic Blood PressureBlood Test ResultMotor ScoreCognitive Score
15682118Positive1035
27887134Negative3229
38590130Negative2714
44383121Negative1536
56786131Positive2025
68292133Negative2612
78895140Positive3410
87187126Negative3322

Where generally worse health outcomes are associated with:

  • a higher blood pressure measurement
  • a positive blood test result
  • a high motor score
  • a low cognitive score

In the examples below, any missing values will be indicated by “N/A” (Not Available) in red bold font.

Missing Completely at Random (MCAR)

Just as the name may suggest, missing data can be characterized as MCAR when it occurs completely randomly and is not due to an event caused by any variables of interest (whether observed or unobserved). Thus, there are no systemic differences between data entries with or without missing values, and no bias is introduced because of the missing data.

In reality, this is quite a strict classification and rarely occurs. Essentially any variable that affects the reason for why the data is missing in the first place has no affect on any of the variables in the study. Therefore, this means that the probability of a data entry being missing is the same for any given data point.

Example: A specific batch of blood samples were incorrectly processed, so the results were discarded. The missing data in the variable of interest (blood test result), is not explained by any observed or unobserved variables.

Participant NumberAgeDiastolic Blood PressureSystolic Blood PressureBlood Test ResultMotor ScoreCognitive Score
15682118N/A1035
27887134N/A3229
38590130N/A2714
44383121Negative1536
56786131Positive2025
68292133Negative2612
78895140Positive3410
87187126Negative3322

Here, the first batch of blood samples had to be discarded.

Missing at Random (MAR)

In contrast, when missingness can be explained by variables with complete data (and is not random) this is known as MAR. Therefore, for a given group defined by an observed variable, the probability of being missing is the same for all individuals of that group. Such missingness may or may not result in bias; if there is bias this can be handled by accounting for the known variable correlated with the reason for missingness.

Example: Blood pressure readings may be missing from individuals who are older, frailer, and have less mobility, and therefore, are more likely to not attend the clinic. In this instance, the reason data is missing in the variable of interest (blood pressure), is related to other observed variables (age and mobility).

Participant NumberAgeDiastolic Blood PressureSystolic Blood PressureBlood Test ResultMotor ScoreCognitive Score
15682118N/A1035
27887134N/A3229
385N/AN/AN/A2714
44383121Negative1536
56786131Positive2025
68292133Negative2612
788N/AN/APositive3410
871N/AN/ANegative3322

Individuals with a high motor score (>26) and who were older (>70) were unable to attend the blood pressure clinic.

Missing not at Random (MNAR)

Data that are MNAR are missing due to reasons that we do not know. In other words, the reason for the missingness is related to the value of the variable that is missing. This is the most complex case of data missingness to handle, as bias may occur but cannot be adjusted for as the source of the missingness is unmeasured.

Example: Follow-up cognitive testing may be missing for individuals who have had significant cognitive decline, as they are more likely to withdraw early from the study. Here, the reason for the missing data in the variable of interest (Cognitive Score) is correlated to unobserved data (the value of the observation itself).

Participant NumberAgeDiastolic Blood PressureSystolic Blood PressureBlood Test ResultMotor ScoreCognitive Score
15682118N/A1035
27887134N/A3229
385N/AN/AN/A27N/A
44383121Negative1536
56786131Positive2025
68292133Negative26N/A
788N/AN/APositive34N/A
871N/AN/ANegative3322

Participants with a cognitive score less than 15, withdrew early from the study due to worsening outcomes.

Summary

We have defined three types of missingness: MCAR, MAR, and MNAR. These definitions are particularly helpful in determining which data handling method to use. Simple implementations were used to demonstrate the types of missingness a small dataset of health outcomes.

However, these are quite oversimplified and real-world datasets can be a lot more complex. For instance, whether a given participant may be unable to come into clinic or willing to continue participating in a study at a younger age/lower motor score/higher cognitive score than average. Alternatively, they may have missing data in a variable that helps explain the reason or mechanism of missingness.

Several types of missingness may also be present in a given dataset, and sometimes multiple types may occur in one variable of interest. Therefore, handling missing data can be quite tricky. Here we directly observed the missing values by looking at the data, however this is a cumbersome and unrealistic task in many real datasets, which maybe have thousands of participants, and hundreds of variables (or more). Thus, visualisation methods that simplify determining any patterns of missingness are incredibly useful. These are explored in the next subchapter (Visualising Missingness).

References
  1. Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592. 10.1093/biomet/63.3.581