Print page Resize text Change font-size Change font-size Change font-size High contrast

Home > Standards & Guidances > Methodological Guide

ENCePP Guide on Methodological Standards in Pharmacoepidemiology


5.1. Definition and validation of drug exposure, outcomes and covariates


Historically, pharmacoepidemiology studies relied on patient-supplied information or searches through paper-based health records. The rapid increase in access to electronic healthcare records and large administrative databases has changed the way exposures and outcomes are defined, measured and validated. All variables should be defined with care taking into account the fact that information is often recorded for purposes other than pharmacoepidemiology.


5.1.1. Assessment of exposure


In pharmacoepidemiology studies, exposure data originate mainly from four data sources: data on prescribing (e.g. CPRD primary care data), data on dispensing (e.g. PHARMO outpatient pharmacy database), data on payment for medication (namely claims data, e.g. IMS LifeLink PharMetrics Plus) and data collected in surveys. The population included in these data sources follows a process of attrition: drugs that are prescribed are not necessarily dispensed, and drugs that are dispensed are not necessarily ingested. In Primary non-adherence in general practice: a Danish register study (Eur J Clin Pharmacol 2014;70(6):757-63), 9.3% of all prescriptions for new therapies were never redeemed at the pharmacy, with different percentages per therapeutic and patient groups. The attrition from dispensing to ingestion is even more difficult to measure, as it is compounded by uncertainties about which dispensed drugs are actually taken by the patients and the patients’ ability to provide an accurate account of their intake. In addition, paediatric adherence is dependent on parents’ accurate recollection and recording.


Exposure definitions can include simple dichotomous variables (e.g. ever exposed vs. never exposed) or be more detailed, including estimates of duration, exposure windows (e.g. current vs. past exposure) or dosage (e.g. current dosage, cumulative dosage over time). Consideration should be given to the level of detail available from the data sources on the timing of exposure, including the quantity prescribed, dispensed or ingested and the capture of dosage instructions. This will vary across data sources and exposures (e.g. estimating anticonvulsant ingestion is typically easier than estimating rescue medication for asthma attacks). Discussions with clinicians regarding sensible assumptions will be informative for the variable definition.

The Methodology chapter of the book Drug Utilization Research. Methods and Applications (M. Elseviers, B. Wettermark, A.B. Almarsdottir et al. Ed. Wiley Blackwell, 2016) discusses different methods for data collection on drug utilisation.

5.1.2. Assessment of outcomes


A case definition compatible with the data source should be developed for each outcome of a study at the design stage. This description should include how events will be identified and classified as cases, whether cases will include prevalent as well as incident cases, exacerbations and second episodes (as differentiated from repeat codes) and all other inclusion or exclusion criteria. The reason for the data collection and the nature of the healthcare system that generated the data should also be described as they can impact on the quality of the available information and the presence of potential biases. Published case definitions of outcomes, such as those developed by the Brighton Collaboration in the context of vaccinations, are useful but are not necessarily compatible with the information available in the observational data sources. For example, information on the duration of symptoms may not be available.


Search criteria to identify outcomes should be defined and the list of codes and any used algorithm should be provided. Generation of code lists requires expertise in both the coding system and the disease area. Researchers should consult clinicians who are familiar with the coding practice within the studied field. Suggested methodologies are available for some coding systems (see Creating medical and drug code lists to identify cases in primary care databases. Pharmacoepidemiol Drug Saf 2009;18(8):704-7). Coding systems used in some commonly used databases are updated regularly so sustainability issues in prospective studies should be addressed at the protocol stage. Moreover, great care should be given when re-using a code list from another study as code lists depend on the study objective and methods. Public repository of codes as is available and researchers are also encouraged to make their own set of coding available.

In some circumstances, chart review or free text entries in electronic format linked to coded entries can be useful for outcome identification. Such identification may involve an algorithm with use of multiple code lists (for example disease plus therapy codes) or an endpoint committee to adjudicate available information against a case definition. In some cases, initial plausibility checks or subsequent medical chart review will be necessary. When databases contain prescription data only, drug exposure may be used as a proxy for an outcome, or linkage to different databases is required.

5.1.3. Assessment of covariates


In pharmacoepidemiology studies, covariates are used for selecting and matching study subjects, comparing characteristics of the cohorts, developing propensity scores, creating stratification variables, evaluating effect modifiers and adjusting for confounders. Reliable assessment of covariates is therefore essential for the validity of results. Patient characteristics and other key covariates that could be confounding variables need to be evaluated using all available data. A given database may or may not be suitable for studying a research question depending on the availability of information on these covariates.


Some patient characteristics and covariates vary with time and accurate assessment is therefore time dependent. The timing of assessment of the covariates is an important factor for the correct classification of the subjects and should be clearly specified in the protocol. Capturing covariates can be done at one or multiple points during the study period. In the later scenario, the variable will be modeled as time-dependent variable.

Assessment of covariates can be done using different periods of time (look-back periods or run-in periods). Fixed look-back periods (for example 6 months or 1 year) are sometimes used when there are changes in coding methods or in practices or when using the entire medical history of a patient is not feasible. Estimation using all available covariates information versus a fixed look-back window for dichotomous covariates (Pharmacoepidemiol Drug Saf 2013; 22(5):542-50) establishes that defining covariates based on all available historical data, rather than on data observed over a commonly shared fixed historical window will result in estimates with less bias. However, this approach may not always be applicable, for example when data from paediatric and adult periods are combined because covariates may significantly differ between paediatric and adult populations (e.g. height and weight).

5.1.4. Validation


In healthcare databases, the correct assessment of drug exposure, outcome and covariate is crucial to avoid misclassification. Validity of diagnostic coding within the General Practice Research Database: a systematic review (Br J Gen Pract 2010;60:e128-36), the book Pharmacoepidemiology (B. Strom, S.E. Kimmel, S. Hennessy. 5th Edition, Wiley, 2012) and Mini-Sentinel's systematic reviews of validated methods for identifying health outcomes using administrative and claims data: methods and lessons learned (Pharmacoepidemiol Drug Saf 2012; Suppl 1:82-9) provide examples.


Potential misclassification of exposure, outcome and other variables should be measured and removed or reduced. Misclassification by exposure should be measured by validating each comparison group. External validation against chart review or physician/patient questionnaire is possible in some instances but the questionnaires cannot always be considered as ‘gold standard’. While the positive predicted value is more easily measured than the negative predictive value, a low specificity is more damageable than a low sensitivity when considering bias in relative risk estimates (see A review of uses of health care utilization databases for epidemiologic research on therapeutics, J Clin Epidemiol 2005;58(4):323-37). When validation of the variable is complete, the study point estimate should be adjusted accordingly (see Use of the Positive Predictive Value to Correct for Disease Misclassification in Epidemiologic Studies, Am J Epidemiol 1993;138 (11):1007–15 and Sentinel Quantitative Bias Analysis Methodology Development: Sequential Bias Adjustment for Outcome Misclassification, 2017).


For databases routinely used in research, documented validation of key variables may have been done previously by the data provider or other researchers. Any extrapolation of previous validation study should however consider the effect of any differences in prevalence and inclusion and exclusion criteria, the distribution and analysis of risk factors as well as subsequent changes to health care, procedures and coding, as illustrated in Basic Methods for Sensitivity Analysis of Biases, (Int J Epidemiol 1996; 25(6): 1107-16).  The accurate date of onset is particularly important for studies relying upon timing of exposure and outcome such as in the self-controlled designs. A comparison of data from registries with clinical or administrative records can also validate individual records on a specific outcome.


Linkage validation can be used when another database is used for the validation through linkage methods (see Using linked electronic data to validate algorithms for health outcomes in administrative databases, J Comp Eff Res 2015; 4:359-66). In some situations, there is no access to a resource to provide data for comparison. In this case, indirect validation may be an option, as explained in the book Applying quantitative bias analysis to epidemiologic data (Lash T, Fox MP, Fink AK Springer-Verlag, New-York, 2009).


Structural validation of the database with internal logic checks can also be performed to verify the completeness and accuracy of variables. For example, one can investigate whether an outcome was followed by (or proceeded from) appropriate exposure or procedures or if a certain variable has values within a known reasonable range.




« Back