Print page Resize text
High contrast

Home > Standards & Guidances > Methodological Guide

Missing data (or missing values) are defined as data value(s) that are not available for a variable in the data source of interest. Missing data are a common problem in all datasets and can have significant consequences on the conclusions that can be drawn from the data for the following reasons: 1) the absence of data reduces statistical power, which refers to the probability that the test will reject the null hypothesis when it is false; 2) the unobservable data can cause bias in the estimation of parameters; 3) it can reduce the representativeness of the sample; 4) it may complicate the analyses as it may render the completeness of data different between variables. Each of these elements can lead to invalid conclusions.

5.3.2. Missing data mechanisms

When dealing with missing data, performing analyses and making inferences are more complex, as assumptions about the processes that create missing data need to be made explicitly.

Missing data is classified into 3 categories, depending on the relationship between observed and missing data:

- Missing completely at random (MCAR): there are no systematic differences between the missing values and the observed values. Missingness is unrelated to any variable in the analysis, including the variable with missing data itself. This is the most restrictive mechanism, but unlikely.
- Missing at random (MAR): any systematic difference between the missing values and the observed values can be explained by differences in the observed data. Missingness is associated with variables in the analysis, but not with the variable with missing data itself. This mechanism is more likely in many real-world settings.
- Missing not at random (MNAR): even after the observed data are taken into account, systematic differences remain between the missing values and the observed values. Missingness depends on the unobserved values and is associated with the variable with missing data itself.

Missing data mechanisms or distribution of the missingness determines the type of analysis that would be possible. In reality, it is not possible to distinguish between those 3 mechanisms based on the observed data alone. The distinction between MCAR and MAR can be made based on the observed data, but subject matter expertise and knowledge about the data collection process are needed to justify the assumption of data being MCAR or MAR. Although we can distinguish between MCAR and MAR using simple tests, it is more difficult to know the appropriate models under MNAR.

Conventional statistical methods assume that all variables in a specified model are measured for all subjects. As this is not always the case, several statistical procedures are developed to account for missingness with the aim to generate meaningful evidence about the population targeted should the data be complete.

Some computationally simple solutions exist, but they generally lead to misleading inferences if the underlying mechanisms are not valid and should be avoided. Examples include carrying forward the last observation in longitudinal analysis, mean substitution, redefining the parameters, or population or only analysing complete data. Complete case analysis (CCA), i.e., removing all records with missing data, is only valid in certain circumstances, e.g., if the missing data is MCAR. Even in these circumstances, CCA will result in loss of power. Therefore, it is advised to use statistical methods to impute missing data.

The choice of such statistical methods will depend on the missing data mechanism. In general, it is desirable to show that conclusions drawn from the data are not sensitive to the particular method used to handle missing values. To investigate this, it may be helpful to repeat the analysis with a variety of statistical approaches and sensitivity analyses to explore how inferences vary under various mechanism assumptions and under various models.

Most
multiple imputation (MI) methods require the data to be MCAR or MAR. If this is
the case, the Fully Conditional Specification (FCS), described in *Flexible Imputation of
Missing Data* (Van Buuren S. 2^{nd} ed. Chapman and Hall/CRC
2018, 10.1201/9780429492259), is a commonly used approach. MI utilises observed
data to predict the value of missing data points, generating multiple complete
data sets, performing analyses on each imputed data set, and then averaging the
results.

If the
data are MNAR, most common methods are not appropriate, and would lead to biased
results. There are methods to handle MNAR data, which depend on different
assumptions or incorporate more specific knowledge about the missingness
mechanism. One example is the not-at-random fully conditional specification
(NARFCS) as described in *On the use of the not-at-random fully conditional
specification (NARFCS) procedure in practice* (Stat Med. 2018, 37(15): 2338–53, 10.1002/sim.7643).

A
commonly used method that should, however, be avoided is to create a category of
the variable, or an indicator, for the missing values. This practice can be
invalid even if the data are missing completely at random, see *Indicator and Stratification Methods for Missing Explanatory
Variables in Multiple Linear Regression* (J Am Stat Assoc.
1996;91(433):222-30).

A
concise review of methods to handle missing data is provided in the section
‘Missing data’ of the *Encyclopedia of Epidemiologic Methods* (Gail MH,
Benichou J, Editors. Wiley 2000) and in the book *Statistical analysis with
missing data *(Little RJA, Rubin DB. 3rd ed., Wiley 2019). The section
‘Handling of missing values’ in *Modern Epidemiology*, T. Lash, T.
VanderWeele, S. Haneuse, K.Rothman. Wolters Kluwer, 2020) is a summary of the
state of the art, focused on practical issues for epidemiologists.

Other
useful references on handling missing data include the books *Multiple
Imputation for Nonresponse in Surveys* (Rubin DB, Wiley, 2004) and *Analysis of Incomplete Multivariate Data *(Schafer JL, Chapman &
Hall/CRC, 1997), and the articles Using the outcome for imputation of missing predictor values was
preferred (J Clin Epi. 2006;59(10):1092-101), Evaluation of
two-fold fully conditional specification multiple imputation for longitudinal
electronic health record data (Stat Med.
2014;33(21):3725-37), Framework for the treatment and reporting of missing data in
observational studies: The Treatment and Reporting of Missing data in
Observational Studies framework (J Clin Epi. 2021;134:79-88).

Many statistical procedures in standard software automatically eliminate subjects with missing data. However, a wide range of statistical software is currently available to impute missing data, mainly focusing on Multiple Imputation (MI) methods when missing data is assumed to be MAR, such as The MI Procedure of the SAS Institute. Multiple imputation of missing values (Stata J. 2004;4:227-41) and mice: Multivariate Imputation by Chained Equations in R (J Stat Soft. 2011;45(3)). A good overview of available software packages is provided in Missing data: A statistical framework for practice (Biom J. 2021;63(5): 915-47).