A growing number of pharmacoepidemiological studies use data from networks of databases, often from different countries. Pooling data across different databases affords insight into the generalisability of the results and may improve precision. Some of these networks are based on long-term contracts with selected partners and are very well structured (such as Sentinel, the Vaccine Safety Datalink (VSD), or the Canadian Network for Observational Drug Effect Studies (CNODES)), while others are looser collaborations based on an open community principle such as Observational Health Data Sciences and Informatics (OHDSI).
In Europe, collaborations for multi-database studies have been strongly encouraged by the drug safety research funded by the European Commission (EC) and public-private partnerships such as the Innovative Medicines Initiative (IMI). This funding resulted in the conduct of groundwork necessary to overcome the hurdles of data sharing across countries for specific projects (e.g. PROTECT, ADVANCE, EMIF, EHDEN, ConcePTION) and for specific post-authorisation studies. With the recent ambition of the European Commission to strive towards a so-called European Health Data Space (EHDS), major breakthroughs in this field are expected. The Joint Action Towards the European Health Data Space – TEHDAS project develops joint European principles for the secondary use of health data. Also, in collaboration with, and acting as a pathfinder to EHDS, the DARWIN EU initiative started in 2022, as a federated network designed to support scientific evaluations and regulatory decision-making.
The 2009 H1N1 influenza pandemic (see Safety monitoring of Influenza A/H1N1 pandemic vaccines in EudraVigilance, Vaccine 2011;29(26):4378-87) and more recently, the 2020 COVID-19 pandemic showed the importance of a formal established infrastructure that can rapidly and effectively monitor the safety of therapeutics and vaccines. In this context, EMA has established contracts with academic and private partners to support readiness of research networks to perform observational research. Three dedicated projects started in 2020: ACCESS (vACcine Covid-19 monitoring readinESS), CONSIGN (COVID-19 infectiOn aNd medicineS In preGNancy) and E-CORE (Evidence for COVID-19 Observational Research Europe). Other initiatives have emerged to address specific COVID-19 related research questions, such as the CVD-COVID-UK consortium (Linked electronic health records for research on a nationwide cohort of more than 54 million people in England: data resource, BMJ. 2021;373:n826), providing a secure access to linked health data from primary and secondary care, registered deaths, COVID-19 laboratory and vaccination data, and cardiovascular specialist audits and covering almost the entire population of England (>54 million people); similar linked data have been made available in trusted research environments for Scotland and Wales (>8 million people).
In this chapter, the term networking is used to reflect collaboration between researchers for sharing expertise and resources. The ENCePP Database of Research Resources may facilitate such networking by providing an inventory of research centres and data sources that collaborate on specific pharmacoepidemiology and pharmacovigilance studies in Europe. It allows the identification of research centres and data sources by country, study, type of research and other relevant fields.
The use of research networks in drug safety, drug utilisation and disease epidemiology is well established. A significant body of practical experience exists, while the use in effectiveness research is becoming more established (Assessing strength of evidence for regulatory decision making in licensing: What proof do we need for observational studies of effectiveness?, Pharmacoepidemiol. Drug Saf. 2020;29(10):1336-40).
From a methodological point of view, studies that adopt a multi-database design have many advantages over single database studies:
Increase of the size of the study populations. This especially facilitates research on rare events, drugs used in specialised setting (see Ability of primary care health databases to assess medicinal products discussed by the European Union Pharmacovigilance Risk Assessment Committee, Clin Pharmacol Ther. 2020;107(4):957-65), or when the interest is in subgroup effects.
Exploit the heterogeneity of treatment options across countries, which allows studying the effect of different drugs used for the same indication or specific patterns of utilisation.
Exploit differences in outcome/event rates across countries/regions.
Provide additional knowledge on the generalisability of results and on the consistency of association, for instance whether a safety issue can be identified in several countries. Possible inconsistencies might be caused by different biases or truly different effects in the databases, revealing causes of differential drug effects, and these might be investigated.
Involve experts from various countries addressing case definitions, terminologies, coding in databases and research practices provides opportunities to increase consistency of results of observational studies.
In case of primary data collection, shorten the time needed for obtaining the desired sample size and therefore accelerate investigation of drug safety issues or other outcomes.
The articles Approaches for combining primary care electronic health record data from multiple sources: a systematic review of observational studies (BMJ Open 2020;10(10): e037405) and Different strategies to execute multi-database studies for medicines surveillance in real world setting: a reflection on the European model (Clin Pharmacol Ther. 2020;108(2):228-35) describe key characteristics of studies using multiple data sources and different models applied for combining data or results from multiple databases. A common characteristic of all models is the fact that data partners maintain physical and operational control over electronic data in their existing environment and therefore the data extraction is always done locally. Differences, however, exist in the following areas: use of a common protocol; use of a common data model (CDM); and where and how the data analysis is conducted.
Use of a CDM implies that local formats are translated into a predefined, common data structure, which allows launching a similar data extraction and analysis script across several databases. Sometimes the CDM imposes a common terminology as well, as in the case of the OMOP CDM. The CDM can be systematically applied on the entire database (generalised CDM) or on the subset of data needed for a specific study (study specific CDM). The CDM transformation is assumed to faithfully represent the source data both in term of completeness and accuracy. Validation studies such as Can We Rely on Results From IQVIA Medical Research Data UK Converted to the Observational Medical Outcome Partnership Common Data Model? A Validation Study Based on Prescribing Codeine in Children (Clin Pharmacol Ther. 2020;107(4): 915-25) are recommended and any deviations that might be found should be carefully monitored and recorded.
In the European Union, study specific CDMs have generated results in several projects and several databases have been converted to a generalised CDM version that exists alongside the native version. The conversion was accelerated during the last year thanks also to the role that observational research had in informing the response to the COVID-19 pandemic. An example of application of generalised CDMs are the studies conducted in the OHDSI community such as Association of angiotensin converting enzyme (ACE) inhibitors and angiotensin 2 receptor blockers (ARB) on COVID-19 incidence and complications or the ConcePTION studies: From Inception to ConcePTION: Genesis of a Network to Support Better Monitoring and Communication of Medication Safety During Pregnancy and Breastfeeding (Clin Pharmacol Ther. 2022;111(1):321-31).
Five models of studies are presented, classified according to specific choices in the steps needed to execute a study: protocol development and agreement (whether separate or common); where the data are extracted and analysed (locally or centrally); how the data are extracted and analysed (using individual or common programs); and use of a CDM and which type (study specific or general). The key characteristics of the steps needed to execute each study model are presented in the following Figure and explained in this section.
The traditional model to combine data from multiple data sources happens when data extraction and analysis are performed independently at each centre based on separate protocols. This is usually followed by meta-analysis of the different estimates obtained (see Chapter 9 and Annex 1).
This type of model, when viewed as a prospective decision to combine results from multiple data sources on the same topic, may be considered as a baseline situation which a research network will try to improve. Moreover, since meta-analyses facilitate the evaluation of heterogeneity of results across different independent studies, it should be used retrospectively regardless of the model of studies used. If all the data sources can be accessed, explaining such variation should also be attempted.
This is coherent with the recommendations from Multi-centre, multi-database studies with common protocols: lessons learnt from the IMI PROTECT project (Pharmacoepidemiol Drug Saf. 2016;25(S1):156-165), stating that investigating heterogeneity may provide useful information on the issue under investigation. This approach eventually increases consistency in findings from observational drug effect studies or reveals causes of differential drug effects.
In this model, data are extracted and analysed locally, with site-specific programs that are developed by each centre, on the basis of a common protocol agreed by study partners that defines and standardises exposures, outcomes and covariates, analytical programmes and reporting formats. The results of each analysis, either at the subject level or in an aggregated format depending on the governance of the network, are shared and can be pooled together using meta-analysis.
This approach allows the assessment of database or population characteristics and their impact on estimates, but reduces variability of results determined by differences in design. Examples of research networks that use the common protocol approach are PROTECT (as described in Improving Consistency and Understanding of Discrepancies of Findings from Pharmacoepidemiological Studies: the IMI PROTECT Project, Pharmacoepidemiol Drug Saf. 2016;25(S1): 1-165), which has implemented this approach in collaboration with CNODES (Major bleeding in users of direct oral anticoagulants in atrial fibrillation: A pooled analysis of results from multiple population-based cohort studies, Pharmacoepidemiol Drug Saf. 2021 Oct;30(10):1339-52).
This approach requires very detailed common protocols and data specifications that reduce variability in interpretation by researchers.
In this approach, a common protocol is agreed by the study partners. Data intended to be used for the study are locally extracted with site-specific programs, transferred without analysis and conversion to a CDM, and pooled and analysed at the central partner receiving them. Data received at the central partner can be reformatted to a common structure to facilitate the analysis.
Examples for this approach are when databases are very similar in structure and content, as is the case for some Nordic registries, or the Italian regional databases. Examples of such models are Risks and benefits of psychotropic medication in pregnancy: cohort studies based on UK electronic primary care health records (Health Technol Assess. 2016;20(23):1–176) and All‐cause mortality and antipsychotic use among elderly persons with high baseline cardiovascular and cerebrovascular risk: a multi‐center retrospective cohort study in Italy (Expert Opin. Drug Metab. Toxicol. 2019;15(2):179-88).
The central analysis allows for assessment of pooled data adjusting for covariates on an individual patient level and removing an additional source of variability linked to the statistical programing and analysis. However, this model becomes more difficult to implement, especially in Europe, due to the stronger privacy requirements when sharing patient level data.
In this approach, a common protocol is agreed by the study partners. Data intended to be used for the study are locally extracted and transformed into an agreed CDM; data in the CDM are then processed locally in all the sites with one common program. The output of the common program is transferred to a specific partner. The output to be shared may be an analytical dataset or study estimates, depending on the governance of the network.
Examples of research networks that used this approach by employing a study-specific CDM with transmission of anonymised patient-level data (allowing a detailed characterisation of each database) are EU-ADR (as explained in Combining multiple healthcare databases for postmarketing drug and vaccine safety surveillance: why and how?, J Intern Med 2014;275(6):551-61), SOS, ARITMO, SAFEGUARD, GRIP, EMIF, EUROmediCAT, ADVANCE, VAC4EU and ConcePTION. In all these projects, a CDM was utilised and R, SAS, STATA or Jerboa scripts used to create and share common analytics. Diagnosis codes for case finding can be mapped across terminologies by using the Codemapper, developed in the ADVANCE project, as explained in CodeMapper: semiautomatic coding of case definitions (Pharmacoepidemiol Drug Saf. 2017;26(8):998-1005).
An example of a study performed using this model is Background rates of Adverse Events of Special Interest for monitoring COVID-19 vaccines, an ACCESS study.
In this approach, the local databases are transformed into a CDM prior to and independently of any study protocol. When a study is required, a common protocol is developed and a centrally created analysis program is created that runs locally on each database to extract and analyse the data. The output of the common programs shared may be an analytical dataset or study estimates, depending on the governance of the network.
Three examples of research networks which use a generalised CDM are the Sentinel Initiative (as described in The U.S. Food and Drug Administration's Mini-Sentinel Program, Pharmacoepidemiol Drug Saf 2012;21(S1):1–303), OHDSI – Observational Health Data Sciences and Informatics and the Canadian Network for Observational Drug Effect Studies (CNODES). The latter was relying on the second model proposed in this chapter, but it has been converted into a CDM, with six provinces having already completed the transformation of their data, as explained in Building a framework for the evaluation of knowledge translation for the Canadian Network for Observational Drug Effect Studies (Pharmacoepidemiol. Drug Saf. 2020;29 (S1),8-25).
The main advantage of a general CDM is that it can be used for virtually any study involving that database. OHDSI is based on the Observational Medical Outcomes Partnership (OMOP) CDM which is now used by many organisations and has been tested for its suitability for safety studies (see for example, Validation of a common data model for active safety surveillance research, J Am Med Inform Assoc. 2012;19(1):54–60, and Can We Rely on Results From IQVIA Medical Research Data UK Converted to the Observational Medical Outcome Partnership Common Data Model?: A Validation Study Based on Prescribing Codeine in Children, Clin Pharmacol Ther. 2020;107(4):915-25). Conversion into the OMOP CDM requires formal mapping of database items to standardised concepts. This is resource intensive and will need to be updated every time the databases are refreshed. Examples of studies performed with the OMOP CDM in Europe are Large-scale evidence generation and evaluation across a network of databases (LEGEND): assessing validity using hypertension as a case study (J Am Med Inform Assoc. 2020;27(8):1268-77) and Safety of hydroxychloroquine, alone and in combination with azithromycin, in light of rapid wide-spread use for COVID-19: a multinational, network cohort and self-controlled case series study (Lancet Rheumatol. 2020;2: e698–711).
In A Comparative Assessment of Observational Medical Outcomes Partnership and Mini-Sentinel Common Data Models and Analytics: Implications for Active Drug Safety Surveillance (Drug Saf. 2015;38(8):749-65), it is suggested that slight conceptual differences between the Sentinel and the OMOP models do not significantly impact on identifying known safety associations. Differences in risk estimations can be primarily attributed to the choices and implementation of the analytic approach.
A future development that has been investigated and could be applied across all models is federated learning. Federated learning is a machine learning technique that trains an algorithm across multiple independent data sources, without exchanging patient-level data. This approach stands in contrast to traditional centralized machine learning techniques where all the local datasets are uploaded to one server. Federated learning enables multiple actors to build a common, robust machine learning model without sharing data, thus allowing to address critical issues such as data privacy, data security, data access rights and access to distributed data. Although federated learning is promising, challenges remain, as discussed in The future of digital health with federated learning (NPJ Digit Med. 2020;14;3:119).
The different models described above present several challenges:
Related to the databases content:
Differences in the underlying health care systems;
Different mechanisms of data generation and collection;
Mapping of different drugs and disease dictionaries (e.g., the International Classification of Disease, 10th Revision (ICD-10), Read codes, the International Classification of Primary Care (ICPC-2));
Free text medical notes in different languages;
Differences in the validation of study variables and access to source documents for validation;
Differences in the type and quality of information contained within each database.
Related to the organisation of the network:
Different ethical and governance requirements in each country regarding processing of anonymised or pseudo-anonymised healthcare data;
Issues linked to intellectual property and authorship;
Implementing quality controls procedures at each partner and across the entire network;
Sustainability and funding mechanisms;
The networks tend to become very topic specific over time and to become isolated in ‘silos’.
Each model has strengths and weaknesses in facing the above challenges, as illustrated in Data Extraction and Management in Networks of Observational Health Care Databases for Scientific Research: A Comparison of EU-ADR, OMOP, Mini-Sentinel and MATRICE Strategies (eGEMs 2016;4(1):2). In particular, a central analysis or a CDM provide protection from problems related to variation in how protocols are implemented as individual analysts might implement protocols differently (as described in Quantifying how small variations in design elements affect risk in an incident cohort study in claims; Pharmacoepidemiol Drug Saf. 2020;29(1):84-93). Experience has shown that many of these difficulties can be overcome by full involvement and good communication between partners, and a clear governance model defining roles, responsibilities and addressing issues of intellectual property and authorship. Several of the networks have made their codes, products data models and analytics software publicly available, such as OHDSI, Sentinel, ADVANCE/VAC4EU. Timeliness or speed for running studies is important in order to meet short regulatory timelines in circumstances where prompt decision-making is needed. Solutions need therefore to be further developed and introduced to be able to run multi-database studies with shorter timelines. Independently from the model used, a critical factor that should be considered in speeding up studies relates to having tasks completed that are independent of any particular study. This includes all activities associated with governance, such as having prespecified agreements on data access, processes for protocol development and study management, and identification and characterisation of a large set of databases. This also includes some activities related to the analysis, such as creating common definitions for frequently used variables, and creating common analytical systems for the most typical and routine analyses (this latter point is made easier with the use of CDMs with standardised analytics and tools that can be re-used to support faster analysis).