Large electronic data sources such as electronic health care records, insurance claims data and administrative data have opened up new opportunities for investigators to rapidly conduct pharmacoepidemiological studies and clinical trials in real-world health care settings and with a large number of subjects. A concern is that these data have not been collected systematically for research on the utilisation, safety or effectiveness of medicinal products, which could affect the validity, reliability and reproducibility of the investigation. Several data quality frameworks have been developed to understand the strengths and limitations of the data to answer a research question, the impact they may have on the study results and the decision to be taken to complement the available data. The dimensions covered by these frameworks overlap with sometimes different terms used for the same dimensions and different levels of details. Quality Control Systems for Secondary Use Data (2022) lists the domains addressed in several of them.
Several data quality frameworks have been published. The European Health Data Space Data Quality Framework (2022) of the Towards European Health Data Space (TEHDAS) project has defined six dimensions deemed the most important ones at data source level: reliability, relevance, timeliness, coherence, coverage and completeness. Kahn’s A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data (eGEMs. 2016;4(1):1244) describes a framework with three data quality categories: Conformance (with sub-categories of Value, Relational Conformance and Computational Conformance), Completeness, and Plausibility (with sub-categories of Uniqueness, Atemporal Plausibility and Temporal Plausibility). These categories are applied in two contexts: Verification and Validation. This framework is used by the US National Patient-Centered Clinical Research Network (PCORnet), with an additional component, Persistence, and the Observational Health Data Science and Informatics (OHDSI) network. Based on this framework, the Data Analytics chapter of the Book of OHDSI (2021) provides an automated tool performing the data quality checks in databases conforming to the OMOP common data model. Increasing Trust in Real-World Evidence Through Evaluation of Observational Data Quality (J Am Med Inform Assoc. 2021;28(10):2251-7) describes an open source R package that executes and summarises over 3,300 data quality checks in databases available in OMOP.
Duke-Margolis Center’s Characterizing RWD Quality and Relevancy for Regulatory Purposes (2018) and Determining Real-World Data’s Fitness for Use and the Role of Reliability (2019) specify that determining if a real-world dataset is fit-for-regulatory-purpose is a contextual exercise, as a data source that is appropriate for one purpose may not be suitable for other evaluations. A real-world dataset should be evaluated as fit-for-purpose if, within the given clinical and regulatory context, it fulfils two dimensions: Data Relevancy (including Availability of key data elements, Representativeness, Sufficient subjects and Longitudinality) and Data Reliability with two aspects: Data Quality (Validity, Plausibility, Consistency, Conformance and Completeness) and Accrual.
Real-World Data for Regulatory Decision Making: Challenges and Possible Solutions for Europe (Clin Pharmacol Ther. 2019;106(1):36-9) describes four criteria for acceptability of RWE for regulatory purposes: Derived from data source of demonstrated good quality, Valid (internal and external), Consistent and Adequate.
Data quality frameworks have been described for specific types of data sources and specific objectives. For example, the EMA’s Guideline on Registry-based studies (2021) describes four quality components for use of patient registries (mainly disease registries) for regulatory purposes: Consistency, Completeness, Accuracy and Timeliness. A roadmap to using historical controls in clinical trials – by Drug Information Association Adaptive Design Scientific Working Group (DIA-ADSWG) (Orphanet J Rare Dis. 2020;15:69) describes the main sources of RWD to be used as historical controls, with an Appendix providing guidance on factors to be evaluated in the assessment of the relevance of RWD sources and resultant analyses.
Algorithms have been proposed to identify fit-for-purpose data to address research questions. For example, The Structured Process to Identify Fit-For-Purpose Data: A Data Feasibility Assessment Framework (Clin Pharmacol Ther. 2022;111(1):122-34) aims to complement FDA’s framework for real-world evidence with a structured and detailed stepwise approach for the identification and feasibility assessment of candidate data sources for a specific study. Whilst such approach should be recommended, the complexity of some of these algorithms may discourage their use in practice. The experience will show to which extent they can support the validity and transparency of study results and ultimately the level of confidence in the evidence provided. It is also acknowledged that many investigators simply use the data source(s) they have access to and are familiar with.