6.3.1. Impact of missing data
6.3.2. Missing data mechanisms
6.3.3. Methods for handling missing data
6.3.4. Statistical software
Missing data (or missing values) are defined as data value(s) that are not available for a variable in the data source of interest for a given analysis, hence are not observed. Missing data may also arise from attrition bias, non-response or poorly designed protocols. Missing data is an error as the data does not represent the true value of what is set out to be measured.
Missing data are a common issue in both clinical trial and observational data, and can have significant consequences on the conclusions that can be drawn from the results of an analysis for the following reasons: 1) the absence of data reduces statistical power, which refers to the probability that the test will reject the null hypothesis when it is false; 2) the unobservable data can introduce bias and increase uncertainty in the estimation of the model parameters; 3) it can reduce the representativeness of the sample; 4) it may complicate the analyses as it may render the completeness of data different between variables. Each of these elements can lead to invalid conclusions. Whether these issues are applicable to the dataset under study depends on the type of missing data (i.e., missing data mechanism).
6.3.2. Missing data mechanisms
When missing data is present, choosing the right statistical methods and making inferences are more complex, as assumptions about the processes that create missing data need to be made explicitly.
Missing data assumptions are classified into 3 categories, depending on the relationship between the unobserved values and the probability of missingness:
Missing completely at random (MCAR): there are no systematic differences between the distribution of the missing values and the observed values. Missingness is unrelated to any variable in the analysis, including the variable with missing data itself. This is the most restrictive mechanism, but rather unrealistic.
Missing at random (MAR): any systematic difference between the missing and observed values for a given variable can be explained by differences in other variables of the observed data. Missingness is associated with those variables, but not with the variable with missing data itself. This mechanism may be more realistic in some real-world settings.
Missing not at random (MNAR): even after the observed data are taken into account, systematic differences remain between the missing values and the observed values. Missingness depends on the unobserved values of the variable with missing data itself.
Assumptions on missing data mechanisms determine the type of analysis that would be possible. In general, it is not possible to distinguish between these 3 mechanisms based on the observed data alone. In order words, missing data assumptions in general cannot be tested or verified. The distinction between MCAR and MAR could be made based on the observed data, but subject matter expertise and knowledge about the data collection process are needed to justify the assumption of data being MCAR or MAR. It is however not feasible to assess MAR versus MNAR based on the observed data.
6.3.3. Methods for handling missing data
Some simple solutions exist, but they generally lead to misleading inferences if the underlying assumptions on mechanisms of missingness are not valid, and they should be avoided. Examples include single imputation methods such as carrying forward the last observation in longitudinal analyses or mean substitution. Complete case analysis (CCA), i.e., removing all records with missing data, is only valid in certain circumstances, e.g., if the missing data is MCAR. Even in these circumstances, CCA will result in loss of power and increased uncertainty in the estimated parameters.
Therefore, it is advised to use other statistical methods to handle missing data, such as multiple imputation (Multiple Imputation and its Application, Wiley 2013, ISBN:9780470740521) or inverse probability weighting (Review of inverse probability weighting for dealing with missing data, Statistical Methods in Medical Research 2013;22:278-95).The choice of such statistical methods will depend on the assumed missing data mechanism.
If the missing data can be assumed to be MCAR or MAR, the Fully Conditional Specification (FCS), described in Flexible Imputation of Missing Data (Van Buuren S. 2nd ed. Chapman and Hall/CRC 2018, 10.1201/9780429492259), is a commonly used approach. MI utilises observed data to predict the value of missing data points, generating multiple complete data sets, performing analyses on each imputed data set, and then averaging the results.
If the missing data are assumed to be MNAR, most common statistical analysis methods are not appropriate, and would lead to biased results. There are methods to handle MNAR data, which depend on different assumptions or incorporate more specific knowledge about the missingness mechanism. One example is the not-at-random fully conditional specification (NARFCS) as described in On the use of the not-at-random fully conditional specification (NARFCS) procedure in practice (Stat Med. 2018, 37(15): 2338–53, 10.1002/sim.7643).
Multiple imputation (MI) methods such as Pattern Mixture Models can be used to implement any missing data assumption (Multiple Imputation and its Application, Wiley 2013, ISBN:9780470740521).
It is important, as explained in The proportion of missing data should not be used to guide decisions on multiple imputation (J Clin Epidemiol. 2019;110:63-73), that the amount of missing data does not decide on the right MI method. In general, it is desirable to understand how sensitive to missing data assumptions are the conclusions drawn from the data, as well as to the particular method used to handle missing values. To investigate this, it is helpful to perform sensitivity analyses exploring how inferences vary under various mechanism assumptions and under various approaches.
A practice sometimes used is to create a category of the variable, or an indicator, for the missing values; however, this should be avoided. This practice can be invalid even if the data are missing completely at random, see Indicator and Stratification Methods for Missing Explanatory Variables in Multiple Linear Regression (J Am Stat Assoc. 1996;91(433):222-30) and Missing data in epidemiological studies (In Armitage P, Colton T, eds. Encyclopedia of biostatistics. Wiley, 1998: 2641-2654.).
A concise review of methods to handle missing data is provided in the book Statistical analysis with missing data (Little RJA, Rubin DB. 3rd ed., Wiley 2019). The section ‘Handling of missing values’ in Modern Epidemiology, 4th ed. (T. Lash, T. VanderWeele, S. Haneuse, K.Rothman. Wolters Kluwer, 2020) is a summary of the state of the art, focused on practical issues for epidemiologists.
Other useful references on handling missing data include the books Multiple Imputation for Nonresponse in Surveys (Rubin DB, Wiley, 2004) and Analysis of Incomplete Multivariate Data (Schafer JL, Chapman & Hall/CRC, 1997), and the articles A comparison of multiple imputation methods for missing data in longitudinal studies (BMC Med Res Methodol. 2018;18(1):168), Using the outcome for imputation of missing predictor values was preferred (J Clin Epi. 2006;59(10):1092-101), and Evaluation of two-fold fully conditional specification multiple imputation for longitudinal electronic health record data (Stat Med. 2014;33(21):3725-37).
The article Framework for the treatment and reporting of missing data in observational studies: The Treatment and Reporting of Missing data in Observational Studies framework (J Clin Epi. 2021;134:79-88) focuses on missing data in non-interventional studies and provides a framework on both analysis and reporting of study results relying on incomplete data.
Many statistical procedures in standard software automatically eliminate subjects with missing data. However, a wide range of statistical software is currently available to impute missing data, mainly focusing on Multiple Imputation (MI) methods when missing data is assumed to be MAR, such as The MI Procedure of the SAS Institute. Multiple imputation of missing values (Stata J. 2004;4:227-41), and mice: Multivariate Imputation by Chained Equations in R (J Stat Soft. 2011;45(3)). A good overview of available software packages is provided in Missing data: A statistical framework for practice (Biom J. 2021;63(5): 915-47). Software tools in SAS and R for multiple imputation of missing data under MAR and MNAR have also been made available by the Drug Information Association Scientific Working Group on Estimands and Missing Data.