Print page Resize text Change font-size Change font-size Change font-size High contrast

Home > Standards & Guidances > Methodological Guide

ENCePP Guide on Methodological Standards in Pharmacoepidemiology


5.3. Missing data


5.3.1. Impact of missing data


Missing data (or missing values) are defined as data value(s) that are not available for a variable in the data source of interest. Missing data are a common problem in all datasets and can have significant consequences on the conclusions that can be drawn from the data for the following reasons: 1) the absence of data reduces statistical power, which refers to the probability that the test will reject the null hypothesis when it is false; 2) the unobservable data can cause bias in the estimation of parameters; 3) it can reduce the representativeness of the sample; 4) it may complicate the analyses as it may render the completeness of data different between variables. Each of these elements can lead to invalid conclusions.


5.3.2. Missing data mechanisms


When dealing with missing data, performing analyses and making inferences are more complex, as assumptions about the processes that create missing data need to be made explicitly.

Missing data is classified into 3 categories, depending on the relationship between observed and missing data:

  • Missing completely at random (MCAR): there are no systematic differences between the missing values and the observed values. Missingness is unrelated to any variable in the analysis, including the variable with missing data itself. This is the most restrictive mechanism, but unlikely.
  • Missing at random (MAR): any systematic difference between the missing values and the observed values can be explained by differences in the observed data. Missingness is associated with variables in the analysis, but not with the variable with missing data itself. This mechanism is more likely in many real-world settings.
  • Missing not at random (MNAR): even after the observed data are taken into account, systematic differences remain between the missing values and the observed values. Missingness depends on the unobserved values and is associated with the variable with missing data itself.

Missing data mechanisms or distribution of the missingness determines the type of analysis that would be possible. In reality, it is not possible to distinguish between those 3 mechanisms based on the observed data alone. The distinction between MCAR and MAR can be made based on the observed data, but subject matter expertise and knowledge about the data collection process are needed to justify the assumption of data being MCAR or MAR. Although we can distinguish between MCAR and MAR using simple tests, it is more difficult to know the appropriate models under MNAR.


5.3.3. Handling missing data


Conventional statistical methods assume that all variables in a specified model are measured for all subjects. As this is not always the case, several statistical procedures are developed to account for missingness with the aim to generate meaningful evidence about the population targeted should the data be complete.


Some computationally simple solutions exist, but they generally lead to misleading inferences if the underlying mechanisms are not valid and should be avoided. Examples include carrying forward the last observation in longitudinal analysis, mean substitution, redefining the parameters, or population or only analysing complete data. Complete case analysis (CCA), i.e., removing all records with missing data, is only valid in certain circumstances, e.g., if the missing data is MCAR. Even in these circumstances, CCA will result in loss of power. Therefore, it is advised to use statistical methods to impute missing data.


The choice of such statistical methods will depend on the missing data mechanism. In general, it is desirable to show that conclusions drawn from the data are not sensitive to the particular method used to handle missing values. To investigate this, it may be helpful to repeat the analysis with a variety of statistical approaches and sensitivity analyses to explore how inferences vary under various mechanism assumptions and under various models.


Most multiple imputation (MI) methods require the data to be MCAR or MAR. If this is the case, the Fully Conditional Specification (FCS), described in Flexible Imputation of Missing Data (Van Buuren S. 2nd ed. Chapman and Hall/CRC 2018, 10.1201/9780429492259), is a commonly used approach. MI utilises observed data to predict the value of missing data points, generating multiple complete data sets, performing analyses on each imputed data set, and then averaging the results.


If the data are MNAR, most common methods are not appropriate, and would lead to biased results. There are methods to handle MNAR data, which depend on different assumptions or incorporate more specific knowledge about the missingness mechanism. One example is the not-at-random fully conditional specification (NARFCS) as described in On the use of the not-at-random fully conditional specification (NARFCS) procedure in practice (Stat Med. 2018, 37(15): 2338–53, 10.1002/sim.7643).


A commonly used method that should, however, be avoided is to create a category of the variable, or an indicator, for the missing values. This practice can be invalid even if the data are missing completely at random, see Indicator and Stratification Methods for Missing Explanatory Variables in Multiple Linear Regression (J Am Stat Assoc. 1996;91(433):222-30).


A concise review of methods to handle missing data is provided in the section ‘Missing data’ of the Encyclopedia of Epidemiologic Methods (Gail MH, Benichou J, Editors. Wiley 2000) and in the book Statistical analysis with missing data (Little RJA, Rubin DB. 3rd ed., Wiley 2019). The section ‘Handling of missing values’ in Modern Epidemiology, T. Lash, T. VanderWeele, S. Haneuse, K.Rothman. Wolters Kluwer, 2020) is a summary of the state of the art, focused on practical issues for epidemiologists.


Other useful references on handling missing data include the books Multiple Imputation for Nonresponse in Surveys (Rubin DB, Wiley, 2004) and Analysis of Incomplete Multivariate Data (Schafer JL, Chapman & Hall/CRC, 1997), and the articles Using the outcome for imputation of missing predictor values was preferred (J Clin Epi. 2006;59(10):1092-101), Evaluation of two-fold fully conditional specification multiple imputation for longitudinal electronic health record data (Stat Med. 2014;33(21):3725-37), Framework for the treatment and reporting of missing data in observational studies: The Treatment and Reporting of Missing data in Observational Studies framework (J Clin Epi. 2021;134:79-88).


5.3.4. Statistical software


Many statistical procedures in standard software automatically eliminate subjects with missing data. However, a wide range of statistical software is currently available to impute missing data, mainly focusing on Multiple Imputation (MI) methods when missing data is assumed to be MAR, such as The MI Procedure of the SAS Institute. Multiple imputation of missing values (Stata J. 2004;4:227-41) and mice: Multivariate Imputation by Chained Equations in R (J Stat Soft. 2011;45(3)). A good overview of available software packages is provided in Missing data: A statistical framework for practice (Biom J. 2021;63(5): 915-47).


« Back