Large electronic data sources such as electronic healthcare records, insurance claims and other administrative data have opened up new opportunities for investigators to rapidly conduct pharmacoepidemiological studies and clinical trials in real-world settings, with a large number of subjects. A concern is that these data have not been collected systematically for research on the utilisation, safety or effectiveness of medicinal products, which could affect the validity, reliability and reproducibility of the investigation. Several data quality frameworks have been developed to understand the strengths and limitations of the data to answer a research question, the impact they may have on the study results, and the decisions to be made to complement available data. The dimensions covered by these frameworks overlap, with different levels of details. Quality Control Systems for Secondary Use Data (2022) lists the domains addressed in several of them.
The following non-exhaustive list provides links to published data quality frameworks generally applicable to data sources, with a short description of their content.
The draft HMA-EMA Data Quality Framework for EU medicines regulation (2022) provides general considerations on data quality that are relevant for regulatory decision-making, definitions for data dimensions and sub-dimensions, as well as ideas for their characterisation and related metrics. It also provides an analysis of what data quality actions and metrics can be put in place in different scenarios and introduces a maturity model to drive the evolution of automation to support data-driven regulatory decision making. The proposed data dimensions include Reliability (with sub-dimensions of Precision, Accuracy and Plausibility), Extensiveness (with sub-dimensions of Completeness and Coverage), Coherence (with the sub-dimensions of formal, structural and semantic coherence, Uniqueness, Conformance and Validity), Timeliness and Relevance.
The European Health Data Space Data Quality Framework (2022) of the Joint Action Towards the European Health Data Space (TEHDAS) project has defined six dimensions deemed the most important ones at data source level: reliability, relevance, timeliness, coherence, coverage and completeness.
Kahn’s A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data (eGEMs. 2016;4(1):1244) describes a framework with three data quality categories: Conformance (with sub-categories of Value, Relational Conformance and Computational Conformance), Completeness, and Plausibility (with sub-categories of Uniqueness, Atemporal Plausibility and Temporal Plausibility). These categories are applied in two contexts: Verification and Validation. This framework is used by the US National Patient-Centered Clinical Research Network (PCORnet), with an additional component, Persistence, and by the Observational Health Data Science and Informatics (OHDSI) network. Based on this framework, the Data Analytics chapter of the Book of OHDSI (2021) provides an automated tool performing the data quality checks in databases conforming to the OMOP common data model. Increasing Trust in Real-World Evidence Through Evaluation of Observational Data Quality (J Am Med Inform Assoc. 2021;28(10):2251-7) describes an open source R package that executes and summarises over 3,300 data quality checks in databases available in OMOP.
Duke-Margolis Center for Health Policy’s Characterizing RWD Quality and Relevancy for Regulatory Purposes (2018) and Determining Real-World Data’s Fitness for Use and the Role of Reliability (2019) specify that determining if a real-world dataset is fit-for-regulatory-purpose is a contextual exercise, as a data source that is appropriate for one purpose may not be suitable for other evaluations. A real-world dataset should be evaluated as fit-for-purpose if, within the given clinical and regulatory context, it fulfils two dimensions: Data Relevancy (including Availability of key data elements, Representativeness, Sufficient subjects and Longitudinality) and Data Reliability with two aspects: Data Quality (Validity, Plausibility, Consistency, Conformance and Completeness) and Accrual.
Data quality frameworks have been described for specific types of data sources and specific objectives. For example, the EMA’s Guideline on Registry-based studies (2021) describes four quality components for use of patient registries (mainly disease registries) for regulatory purposes: Consistency, Completeness, Accuracy and Timeliness. A roadmap to using historical controls in clinical trials – by Drug Information Association Adaptive Design Scientific Working Group (DIA-ADSWG) (Orphanet J Rare Dis. 2020;15:69) describes the main sources of RWD to be used as historical controls, with an Appendix providing guidance on factors to be evaluated in the assessment of the relevance of RWD sources and resultant analyses.
Algorithms have been proposed to identify fit-for-purpose data to address research questions. For example, The Structured Process to Identify Fit-For-Purpose Data: A Data Feasibility Assessment Framework (Clin Pharmacol Ther. 2022;111(1):122-34) and its update, A Structured Process to Identify Fit-for Purpose Study Design and Data to Generate Valid and Transparent Real-World Evidence for Regulatory uses (Clin Pharmacol Ther. 2023;113(6):1235-1239), aim to complement FDA’s framework for RWE with a structured and detailed stepwise approach for the identification and feasibility assessment of candidate data sources for a specific study. The update emphasises the importance of initial study design, including designing a hypothetical target trial as a benchmark for the real-world study design before proceeding to data feasibility assessment. Whilst the approach of data feasibility assessment should be recommended, the complexity of some of the algorithms may discourage their use in practice. The experience will show to which extent they can support the validity and transparency of study results and ultimately the level of confidence in the evidence provided. It is also acknowledged that many investigators simply use the data source(s) they have access to and are familiar with in terms of potential bias, confounding and missing data.