Print page Resize text Change font-size Change font-size Change font-size High contrast

Home > Standards & Guidances > Methodological Guide

ENCePP Guide on Methodological Standards in Pharmacoepidemiology


4. Approaches to data collection

There are two main approaches for data collection: collection of data specifically for a particular study (‘primary data collection’) or use of data already collected for another purpose, e.g. as part of administrative records of patient health care (‘secondary data collection’). The distinction between primary and secondary data collection is important for marketing authorisation holders as it implies different regulatory requirements for the collection and reporting of suspected adverse reactions, as described in Module VI of the Guideline on good pharmacovigilance practice (GVP) - Management and reporting of adverse reactions to medicinal products.


Secondary data collection has become the most common approach used in pharmacoepidemiology due to the increasing availability of electronic healthcare records, administrative claims data and other already existing data sources (see section 4.2 Secondary data collection). In addition, networking between centres active in pharmacoepidemiology and pharmacovigilance is rapidly changing the landscape of drug safety research in Europe, both in terms of networks of data and networks of researchers who can contribute to a particular study with a particular data source (see section 4.6 Research Networks).


4.1. Primary data collection


Collection of specific data for a study has played an important role in pharmacoepidemiology and methodological aspects for the conduct of primary data collection studies are well covered in the textbooks and guidelines referred to in the Introduction section. . Annex 1 of Module VIII of the Good pharmacovigilance practice provides includes examples of different study designs based on prospective primary data collection such as cross-sectional study, prospective cohort study, active surveillance, etc. Survey method and randomised controlled trials as examples of primary data collections are presented.


Studies using hospital or community-based primary data collection have allowed the evaluation of drug-disease associations for rare complex conditions that require very large source populations and in-depth case assessment by clinical experts. Classic examples are Appetite-Suppressant Drugs and the Risk of Primary Pulmonary Hypertension (N Engl J Med 1996;335:609-16), The design of a study of the drug etiology of agranulocytosis and aplastic anemia (Eur J Clin Pharmacol 1983;24:833-6) and Medication Use and the Risk of Stevens–Johnson Syndrome or Toxic Epidermal Necrolysis (N Engl J Med 1995;333:1600-8). For some conditions, case-control surveillance networks have been developed and used for selected studies and for signal generation and clarification, e.g. Signal generation and clarification: use of case-control data (Pharmacoepidemiol Drug Saf 2001;10:197-203).


4.1.1. Surveys


A survey is a data collection tool used to gather information about individuals. Surveys are commonly used to collect self-reported data, either on factual information about individuals, or their opinions. They generally have a cross-sectional design and represent a form primary data collection conducted through questionnaires administered by web, phone or paper.


Although used for a long time in other areas as social science or marketing, surveys are nowadays also increasingly used in pharmacoepidemiology, especially in the areas of epidemiology and evaluation of risk minimisation measure (RMM) effectiveness.


Questionnaires used in surveys should be validated based on accepted measures including construct, criterion and content validity, inter-rater and test-retest reliability, sensitivity and responsiveness.

Recommendations with regards to data collection, which medium to use, how to recruit a representative sample and how to formulate the questions in a non-directive way to avoid information bias, are described in the following textbooks: Survey Sampling (L. Kish, Wiley, 1995) and Survey Methodology (R.M. Groves, F.J. Fowler, M.P. Couper et al., 2nd Edition, Wiley 2009).


Although primarily focused on quality of life research, the book Quality of Life: the assessment, analysis and interpretation of patient-related outcomes (P.M. Fayers, D. Machin, 2nd Edition, Wiley, 2007) offers a comprehensive review of the theory and practice of developing, testing and analysing questionnaires in different settings. Health Measurement Scales: a practical guide to their development and use (D. L. Streiner, G. R. Norman, 4th Edition, Oxford University Press, 2008) is a very helpful guide to those involved in measuring subjective states and learning style in patients and healthcare providers.


Representativeness is an important element for surveys; the included sample should be representative of the target population and must be defined with regards to the research question. For example, if the objective of the survey is to evaluate whether the RMM are distributed among the right target population, the lists which are used for the distribution of the RMM material cannot be used as the source population for sampling.


The response rate is also an important metric of survey and it should be reported for each survey based on a standard definition so that the comparison among different surveys is possible. Standard Definitions. Final Dispositions of Case Codes and Outcome Rates for Surveys of the American Association for Public Opinion Research provides standard definitions which can be adapted to the context of pharmacoepidemiological surveys. The overall response rate of participation remains low in telephone surveys (J.M. Lepkowski, N.C. Tucker, J.M Bricket al., Ed. Advances in telephone survey methodology Wiley 2007, Part V) and is important to counteract since it leads to lack of power and reduced representativeness. These measures include the use of short or personalised questionnaires approved by professional associations.


4.1.2. Randomised clinical trials


Randomised clinical trials are another form of primary data collection. There are numerous textbooks and publications on methodological and operational aspects of clinical trials and they are not covered here. An essential guideline on clinical trials is the European Medicines Agency (EMA) Note for Guidance on Good Clinical Practice, which specifies obligations for the conduct of clinical trials to ensure that the data generated in the trial are valid.


4.2. Secondary data collection


Secondary data collection refers to collection of data already gathered for another purpose (e.g. electronic healthcare data, claims or prescription data). These can also be linked to non-medical data. The last two decades have witnessed the development of key data resources, expertise and methodology that have allowed use of such data for pharmacoepidemiology. The ENCePP Inventory of Data Sources contains key information on the databases that are registered in ENCePP. Section 4.6 of this Guide also describes existing research networks.


A comprehensive description of the main features and applications of frequently used databases for pharmacoepidemiology research in the United States and in Europe appears in the book Pharmacoepidemiology (B. Strom, S.E. Kimmel, S. Hennessy. 5th Edition, Wiley, 2012, Chapters 11 - 18). The limitations existing in using electronic healthcare databases should be acknowledged, as detailed in A review of uses of healthcare utilisation databases for epidemiologic research on therapeutics (J Clin Epidemiol 2005; 58: 23-337).


The primary purpose of the ISPE-endorsed Guidelines for Good Database Selection and use in Pharmacoepidemiology Research (Pharmacoepidemiol Drug Saf 2012;21:1-10) is to assist in the selection and use of data resources in pharmacoepidemiology by highlighting potential limitations and recommending tested procedures. This text mainly refers to databases of routinely collected healthcare information and does not include spontaneous report databases. It is a simple, well-structured guideline that will help investigators when selecting databases for their research and helps database custodians to describe their database in a useful manner. An entire section is dedicated to the use of multi-site studies. The entire document contains references to data quality and data processing/transformation issues and there are sections dedicated to quality and validation procedures. There are also separate sections on privacy and security.


The Working Group for the Survey and Utilisation of Secondary Data (AGENS) with representatives from the German Society for Social Medicine and Prevention (DGSPM) and the German Society for Epidemiology (DGEpi) developed a Good Practice in Secondary Data Analysis Version 2 aiming to establish a standard for planning, conducting and analysing studies on the basis of secondary data. It is also aimed to be used as the basis for contracts between data owners (so-called primary users) and secondary users. It is divided into 11 sections addressing, among other aspects, the study protocol, quality assurance and data protection.

The FDA’s Best Practices for Conducting and Reporting Pharmacoepidemiologic Safety Studies Using Electronic Health Care Data Sets provides criteria for best practice that apply to design, analysis, conduct and documentation. It emphasizes that investigators should understand potential limitations of electronic healthcare data systems, make provisions for their appropriate use and refer to validation studies of safety outcomes of interest in the proposed study and captured in the database.


General guidance for studies including those conducted with electronic healthcare databases can also be found in the ISPE GPP, in particular sections IV-B (Study conduct, Data collection). This guidance emphasises the paramount importance of patient data protection.


The International Society for Pharmacoeconomics and Outcome Research (ISPOR) established a task force to recommend good research practices for designing and analysing retrospective databases for comparative effectiveness research (CER). The Task Force has subsequently published three articles (Part I, Part II and Part III) that review methodological issues and possible solutions for CER studies based on secondary data analysis (see also section 10.1 on comparative effectiveness research). Many of the principles are applicable to studies with other objectives than CER, but aspects of pharmacoepidemiological studies based on secondary use of data, such as data quality, ethical issues, data ownership and privacy, are not covered.

Particular issues to be considered in the use of electronic healthcare data for pharmacoepidemiological research include the following:

  • Completeness of data capture: does the database reliably capture all of the patient’s healthcare interactions or are there known gaps in coverage, capture, longitudinality or eligibility? Researchers using claims data rarely have the opportunity to carry out quality assurance on the whole data set. Descriptive analyses of the integrity of a US Medicaid Claims Database (Pharmacoepidemiol Drug Saf 2003;12:103–11) concludes that performing such analyses can reveal important limitations of the data and whenever possible, researchers should examine the ‘parent’ data set for apparent irregularities.
  • The relevance of bias in assessment of drug exposure for quality control in clinical databases: European Surveillance of Antimicrobial Consumption (ESAC): Data Collection Performance and Methodological Approach (Br J Clin Pharmacol 2004;58: 419-28) describes a retrospective data collection effort (1997–2001) through an international network of surveillance systems, aimed at collecting publicly available, comparable and reliable data on antibiotic use in Europe. The data collected were screened for bias, using a checklist focusing on detection bias in sample and census data, errors in assigning medicinal product packages to the Anatomical Therapeutic Chemical Classification System, errors in calculations of Defined Daily Doses per package, bias by over-the-counter sales and parallel trade, and bias in ambulatory/hospital care mix. The authors describe the methodological rigour needed to assure data validity and to ensure reliable cross-national comparison.


  • Validity of diagnoses: Validation and validity of diagnoses in the General Practice Research Database (GPRD): a systematic review (Br J Clin Pharmacol 2010;69:4-14) investigated the range of methods used to validate diagnoses in a primary care database and concluded that a number of methods had been used to assess validity and that overall, estimates of validity were high. The quality of reporting of the validations was, however, often inadequate to permit a clear interpretation. Not all methods provided a quantitative estimate of validity and most methods considered only the positive predictive value of a set of diagnostic codes in a highly selected group of cases.



  • The impact of changes over time in data, access methodology and the environment: Evidence generation from healthcare databases: recommendations for managing change (Pharmacoepidemiol Drug Saf 2016;25(7):749-754) proposes aspects to be considered to minimise the occurrence of problems of validity, reproducibility and comparability because of changes in the data or systems. A section addresses issues that may occur where common data models and associated tools are introduced.


An example of the hazards of using large linked databases is provided in Vaccine safety surveillance using large linked databases: opportunities, hazards and proposed guidelines (Expert Rev Vaccines 2003; 2(1):21-9).

Quality management is further addressed in section 7 of the Guide.


4.3. Patient registries


4.3.1. Definition


A registry is an organised system that uses observational methods to collect uniform data on specified outcomes in a population defined by a particular disease, condition or exposure. A register is the database deriving from the registry (such as the EU PAS Register), and the difference between the two terms should be clearly understood even if they are often used interchangeably. The terms ‘register’ or ‘registry’ are sometimes incorrectly used to designate an exhaustive list of all patients who meet the eligibility criteria of a study, regardless of their inclusion in the study. The term ‘patient log-list’ could be used for this purpose.


A patient registry should be considered as a structure for the standardised recording of data from routine clinical practice on individual patients identified by the diagnosis of a disease, the occurrence of a condition (e.g., pregnancy), the prescription of a medicinal product (e.g., monoclonal antibodies) , a hospital encounter, or any combination of these.


In European Nordic countries where there is a comprehensive registration of hospital data for a high proportion of the population, government-administered patient registries are administrative systems based on hospital encounters including visit information, diagnoses and procedures, such as the Norwegian Patient Registry, the Danish National Patient Registry or the Swedish National Patient Register. They may however lack information on lifestyle factors, patient-related outcomes and laboratory data. A review of 103 Swedish Healthcare Quality Registries (J Intern Med 2015; 277(1): 94–136) describes additional healthcare quality registries focusing on specific disorders initiated in Sweden mostly by physicians with data on aspects of disease management, self-reported quality of life, lifestyle, and general health status, providing an important source for research.


4.3.2. Conceptual differences between a registry and a study


As illustrated in Registries in European post-marketing surveillance: a retrospective analysis of centrally approved products, 2005–2013 (Pharmacoepidemiol Drug Saf 2017 Mar 26), the conceptual differences between registries and studies need to be clearly understood.  


Patient registries are often integrated into routine clinical practice with systematic and sometimes automated data capture in electronic healthcare records, but disease, exposure or outcome-specific registries usually require recording of specific relevant data. Whilst the duration of a registry is normally open-ended, that of a study is dictated by the time needed to define and collect data relevant for the specific study objectives. Studies also often require introduction of specific procedures, questionnaires or data collection tools. Studies are set up and managed based on limited endpoints and a specific protocol, whereas patient registries are traditionally set up focusing on system(s) specifications in order to ensure a continuous, efficient and collaborative data collection; safe data hosting; accessible, retrievable, interoperable and re-usable data.


A registry can be used as a source of patients for studies based on either primary data collection (where the data collected for new patients are also used for a specific study) or secondary data collection (analogously to the use of electronic healthcare records). For this purpose, registries data can be enriched with additional information on outcomes, lifestyle data, immunisation and mortality information obtained from linkage to the existing database such as national cancer registries, prescription databases or mortality records.


4.3.3. Methodological guidance


The US Agency for Health Care Research and Quality (AHRQ) published a comprehensive document on ‘good registry practices’ entitled Registries for Evaluating Patient Outcomes: A User's Guide, 3rd Edition, which provides methodological guidance on planning, design, implementation, analysis, interpretation and evaluation of the quality of a registry. There is a dedicated section for linkage of registries to other data sources. The EU PARENT Joint Action developed methodological and governance guidelines to facilitate cross-border use of registries.  


Results obtained from analyses of registry data may be affected by the same biases as those of studies described in section 5.2 Bias and confounding. Registries are particularly sensitive to the occurrence of selection bias. This is due to the fact that factors that may influence the enlistment of patients in a registry may be numerous (including clinical, demographic and socio-economic factors) and difficult to predict and identify, potentially resulting in a biased sample of the patient population in case the recruitment has not been exhaustive. In addition, studies that use registry data may also introduce selection bias in the recruitment or selection of registered patient for the specific study, as well as in the differential completeness of follow-up and data collection. It is therefore important to systematically compare the characteristics of the study population with those of the source population.


The randomised registry trial is a new study design that combines the robustness of randomised studies with the higher generalisability of registry data, see section 5.6.3.


4.3.4. Registries which capture special populations


In assessing both safety and effectiveness, special populations can be identified based on age (e.g., paediatric or elderly), pregnancy status, renal or hepatic function, race, or genetic differences. Some registries are focused on these particular populations. Examples of these are the birth registries in Nordic countries. 


The FDA’s Guidance for Industry-Establishing Pregnancy Exposure Registries advises on good practice for designing a pregnancy registry with a description of research methods and elements to be addressed. The Systematic overview of data sources for drug safety in pregnancy research provides an inventory of pregnancy exposure registries and alternative data sources on safety of prenatal drug exposure and discusses their strengths and limitations. Example of population-based registers allowing to assess outcome of drug exposure during pregnancy are EUROCAT, the European network of registries for the epidemiologic surveillance of congenital anomalies, and the pan-Nordic registries which record drug use during pregnancy as illustrated in Selective serotonin reuptake inhibitors and venlafaxine in early pregnancy and risk of birth defects: population based cohort study and sibling design (BMJ 2015;350:h1798).


For paediatric populations, detailed information on neonatal age (e.g. in days, not just in years), pharmacokinetic differences and organ maturation need to be considered. The CHMP Guideline on Conduct of Pharmacovigilance for Medicines Used by the Paediatric Population provides further relevant information. An example of registry which focuses on paediatric patients is Pharmachild, which captures children with juvenile idiopathic arthritis undergoing treatment with methotrexate or biologic agents.


Other registries that focus on special populations (e.g., the UK Renal Registry) can be found in the ENCePP Inventory of data sources.


4.3.5. Disease registries in regulatory practice and health technology assessment


Annex 1 of Module VIII of the Good pharmacovigilance practice provides guidance on use of patient registries for regulatory purpose. It emphasises that the choice of the registry population and the design of the registry should be driven by its objective(s) in terms of outcomes to be measured and analyses and comparisons to be performed. As existing disease registries gather insights into the natural history and clinical aspects of diseases and allow comparison of outcomes between different treatments prescribed for the same indication, they are generally preferred to product registries for regulatory purposes. Module VIII also acknowledges that, due to their observational nature, registries should not normally be used to demonstrate effectiveness in real  world setting, although in some cases (such as rare disease, rare exposure or special population), registries may be the only opportunity to provide insight into effectiveness aspects of a medicinal product. On the other hand, even when efficacy has been demonstrated in randomized clinical trials (RCTs), registries may be useful to study effectiveness in heterogeneous populations and effect modifiers, such as doses that have been prescribed by physicians and that may differ from those used in RCTs, patient sub-groups defined by variables such as age, co-morbidities, use of concomitant medication or genetic factors, or factors related to a defined country or healthcare system that might influence effectiveness.


To support better use of existing registries and facilitate the establishment of high-quality new registries, the EU regulatory network developed the Patient registries initiative. As part of this initiative, the ENCePP Resource database of data sources was used to support an inventory of existing disease registries.


Incorporating data from clinical practice into the drug development process is a growing interest from health technology assessment (HTA) bodies and payers since reimbursement decisions can benefit from better estimation and prediction of effectiveness of treatments at the time of product launch. An example of where registries can provide clinical practice data is to support the building of predictive models that incorporate data from both RCTs and registries to bridge the efficacy-effectiveness gap, i.e. to generalise results observed in RCTs to a real-world setting. Collecting relevant HTA data in early development and planning post-authorisation data collection may therefore support rapid relative effectiveness assessment and decision-making on drug pricing and reimbursement. In this context, the EUnetHTA Joint Action 3 project has issued guidelines for the definition of the research questions and the choice of data sources and methodology that will support the generation of post-launch evidence.


4.4. Spontaneous report database


Spontaneous reports of adverse drug effects remain a cornerstone of pharmacovigilance and are collected from a variety of sources, including healthcare providers, national authorities, pharmaceutical companies, medical literature and more recently directly from patients. EudraVigilance is the European Union data processing network and management system for reporting and evaluation of suspected adverse drug reactions (ADRs). The Global Individual Case Safety Reports Database System (VigiBase) pools reports of suspected ADRs from the members of the WHO programme for international drug monitoring. These systems deal with the electronic exchange of Individual Case Safety Reports (ICSRs), the early detection of possible safety signals and the continuous monitoring and evaluation of potential safety issues in relation to reported ADRs. The report Characterization of databases (DB) used for signal detection (SD) of the PROTECT project shows the heterogeneity of spontaneous databases and the lack of comparability of SD methods employed. This heterogeneity is an important consideration when assessing the performance of SD algorithms.


The strength of spontaneous reporting systems is that they cover all types of legal drugs used in any setting. In addition to this, the reporting systems are built to obtain information specifically on potential adverse drug reactions and the data collection concentrates on variables relevant to this objective and directs reporters towards careful coding and communication of all aspects of an ADR. The increase in systematic collection of ICSRs in large electronic databases has allowed the application of data mining and statistical techniques for the detection of safety signals. There are known limitations of spontaneous ADR reporting systems, which include limitations embedded in the concept of voluntary reporting, whereby known or unknown external factors may influence the reporting rate and data quality. ICSRs may be limited in their utility by a lack of data for an accurate quantification of the frequency of events or the identification of possible risk factors for their occurrence. For these reasons, the concept is now well accepted that any signal from spontaneous reports needs to be verified clinically before further communication.


One challenge in spontaneous report databases is report duplication. Duplicates are separate and unlinked records that refer to one and the same case of a suspected ADR and may mislead clinical assessment or distort statistical screening. They are generally detected by individual case review of all reports or by computerised duplicate detection algorithms. In Performance of probabilistic method to detect duplicate individual case safety reports (Drug Saf 2014;37(4):249-58) a probabilistic method highlighted duplicates that had been missed by a rule-based method and also improved the accuracy of manual review. In the study, however, a demonstration of the performance of de-duplication methods to improve signal detection is lacking.


Validation of statistical signal detection procedures in EudraVigilance post-authorisation data: a retrospective evaluation of the potential for earlier signalling (Drug Saf 2010;33: 475 – 87) has shown that the statistical methods applied in EudraVigilance can provide significantly early warning in a large proportion of drug safety problems. Nonetheless, this approach should supplement, rather than replace, other pharmacovigilance methods.


Chapters IV and V of the Report of the CIOMS Working Group VIII ‘Practical aspects of Signal detection in Pharmacovigilance’ present sources and limitations of spontaneously-reported drug-safety information and databases that support signal detection. Appendix 3 of the report provides a list of international and national spontaneous reporting system database.


4.5. Social media and electronic devices


Technological advances have dramatically increased the range of data sources that can be used to complement traditional ones and provide compelling insights into effectiveness and safety of interventions. Such data include digital social media that exist in a computer-readable format on websites, web pages, blogs, vlogs, social networking sites, internet forums, chat rooms, health portals, etc. A recent addition to this list is represented by the data collected through mobile and other device applications such as wearable technology.


There is a growing interest to use these data sources to generate patient-generated information relevant for medicines safety surveillance.


Social media is a source of potential reports of suspected ADRs. Marketing authorisation holders (MAHs) are legally obliged to screen web sites under their management and assess whether they qualify for reporting Spontaneous ADRs identified from social media should be handled as unsolicited reports and evaluated and reported in a similar way.


Social media is already being used to provide insights into the patient’s perception of the effectiveness of drugs and for the collection of patient reported outcomes (PROs) as discussed in Web-based patient-reported outcomes in drug safety and risk management: challenges and opportunities? (Drug Saf. 2012;35(6):437-46).


Another use of social media might be for identification of new safety issues (signal detection). It would have added value only if more issues would be identified or it would help faster identification, but there is currently no evidence this is the case. Using Social Media Data in Routine Pharmacovigilance: A Pilot Study to Identify Safety Signals and Patient Perspectives (Pharm Med (2017) 31: 167-174) explores whether analysis of social media data could identify new signals, known signals from routine pharmacovigilance, known signals sooner, and specific issues (i.e., quality issues and patient perspectives). Also of interest in this study was to determine the quantity of ‘posts with resemblance to AEs’ (proto-AEs) and the types and characteristics of products that would benefit from social media analysis. It concludes that social media data analysis cannot identify new safety signals but can provide unique insight into the patient perspective. Assessment was limited by numerous factors, such as data acquisition, language, and demographics. Further research is deemed necessary to determine the best uses of social media data to augment traditional pharmacovigilance surveillance.


There is one ongoing EU project investigating the potential for publicly available social media data for identifying drug safety issues (WEB-RADR). The results of the WEB-RADR project will inform regulatory policy on the use of social media for pharmacovigilance, initial results show there may be utility for specific niche areas such as misuse/abuse or off-label use.


While offering the promise of new research models and approaches, the rapidly evolving social media environment presents many challenges including the need for strong and systematic processes for selection, validation and study implementation. Articles which detail associated challenges are: Evaluating Social Media Networks in Medicines Safety Surveillance: Two Case Studies (Drug Saf. 2015; 38(10): 921–930.) and Social media and pharmacovigilance: A review of the opportunities and challenges (Br J Clin Pharmacol. 2015 Oct; 80(4): 910–920).


There is currently no defined strategy or framework in place in order to meet the standards around data validity, generalisability for this type of data. Therefore regulatory acceptance of this type of data might be lower than for traditional sources.


More tools and solutions for analysing unstructured data, especially for pharmacoepidemiology and drug safety research, are becoming available, as in Deep learning for pharmacovigilance: recurrent neural network architectures for labelling adverse drug reactions in Twitter posts (J Am Med Inform Assoc. 2017 Feb 22) and Social Media Listening for Routine Post-Marketing Safety Surveillance (Drug Saf. 2016;39(5):443-54).


Before an informed strategy is put in place, the following factors may be considered when using social media and electronic data sources and devices using social media:

  • Completeness of data capture.
  • Validation processes defined for the devices, including accuracy
  • Reliability and reproducibility of outputs/inputs from the device
  • Data warehousing requirements for secure storage of the volume of data potentially received from wearable devices.

Data from social media and electronic devices can be both structured and unstructured. When analysing unstructured data, the following factors may be considered:

  • Tools used for crawling the web and the methods used for handling unstructured data should be well defined along with their potential limitations e.g. the type of natural language processing (NLP) approach and software used.
  • How exposure and outcomes were defined within unstructured data and whether these have been derived and validated.

4.6. Research networks


4.6.1. General considerations


The need to pool data across different databases in order to gain power and increase generalisability of the results is becoming increasingly necessary. In Europe, collaborations for multi-database studies have been strongly encouraged over the last years by the drug safety research funded by the European Commission (EC) and public-private partnerships such as the Innovative Medicines Initiative (IMI). The funding resulted in the conduct of groundwork necessary to overcome the hurdles of data sharing across countries. A growing number of studies use data from networks of databases, often from different countries.


In the US, the HMO Research Network (HMORN), the OHDSI and the Sentinel initiative are examples of consortia involving health maintenance organisations that have formal, recognised research capabilities. Networking implies collaboration between investigators in sharing expertise and resources. The ENCePP Database of Research Resources may facilitate such networking by providing an inventory of research centres and data sources that can collaborate on specific pharmacoepidemiology and pharmacovigilance studies in Europe. It allows the identification of centres and data sets by country, type of research and other relevant fields.


From a methodological point of view, research networks have many advantages:

  • The potential for pooling data or results maximises the amount of information gathered for a specific issue addressed in different databases.
  • Research networks increase the size of study populations and shorten the time needed for obtaining the desired sample size. Hence, they can facilitate research on rare events and speed-up investigation of drug safety issues.
  • The heterogeneity of treatment options across countries allows studying the effect of individual drugs.
  • Research networks may provide additional knowledge on whether a drug safety issue exists in several countries and thereby reveal causes of differential drug effects, on the generalisability of results, on the consistency of information and on the impact of biases on estimates.
  • Involvement of experts from various countries addressing case definitions, terminologies, coding in databases and research practices provides opportunities to increase consistency of results of observational studies.
  • Sharing of data sources facilitates harmonisation of data elaboration and transparency in analyses and benchmarking of data management.

Different models have been applied for combining data or results from multiple databases. A common characteristic of all models is the fact that data partners maintain physical and operational control over electronic data in their existing environment. Differences however exist on whether a common protocol or a common data model is applied across all databases to extract, analyse and combine the data. A common data model (CDM) approach provides a similar representation of the database that allows standardisation of administrative and clinical information and facilitates a combined analysis across several databases. The CDM can be systematically applied on all data of a database (generalised CDM) or on the subset of data needed for a specific study (study-specific CDM).


4.6.2. Models of studies using multiple data sources


i) Local data extraction and analysis, separate protocols

The traditional way to combine data from multiple data sources is when data extraction and analysis are performed independently at each centre based on separate protocols. This is usually followed by meta-analysis of the different estimates obtained (see Chapter 5.7).


ii) Local data extraction and analysis, common protocol

In this option, data are extracted and analysed locally on the basis of a common protocol. Definitions of exposure, outcomes and covariates, analytical programmes and reporting formats are standardised according to a common protocol and the results of each analysis are shared in an aggregated format and pooled together through meta-analysis. This approach allows assessment of database/population characteristics and their impact on estimates but reduces variability of results determined by differences in design. Examples of research networks that use the common protocol approach are the PROTECT project (as described in Improving Consistency and Understanding of Discrepancies of Findings from Pharmacoepidemiological Studies: the IMI PROTECT Project. (Pharmacoepidemiol Drug Saf 2016;25(S1): 1–165) and the Canadian Network for Observational Drug Effect Studies (CNODES).


This approach requires very detailed common protocols and data specifications that reduce variability in interpretations by researchers.


Multi-centre, multi-database studies with common protocols: lessons learnt from the IMI PROTECT project (Pharmacoepidemiol Drug Saf 2016;25(S1):156-165) states that a priori pooling of data from several databases may disguise heterogeneity that may provide useful information on the safety issue under investigation. On the other hand, parallel analysis of databases allows exploring reasons for heterogeneity through extensive sensitivity analyses. This approach eventually increases consistency in findings from observational drug effect studies or reveal causes of differential drug effects.


iii) Local data extraction and central analysis, common protocol


For some studies, it has been possible to analyse centrally patient level data extracted based on a common protocol, such as in Selective serotonin reuptake inhibitors during pregnancy and risk of persistent pulmonary hypertension in the newborn: population based cohort study from the five Nordic Countries (BMJ 2012;344:d8012). If databases are very similar in structure and content as is the case for some Nordic registries, a CDM might not be required for data extraction. The central analysis allows removing an additional source of variability linked to the statistical programing and analysis.


iv) Local data extraction and central analysis, study-specific common data model


Data can also be extracted from local databases using a study-specific, database-tailored extraction into a CDM and pre-processed locally. The resulting data can be transmitted to a central data warehouse as patient-level data or aggregated data for further analysis. Examples of research networks that used this approach by employing a study-specific CDM with transmission of anonymised patient-level data (allowing a detailed characterisation of each database) are EU-ADR (as explained in Combining electronic healthcare databases in Europe to allow for large-scale drug safety monitoring: the EU-ADR Project, Pharmacoepidemiol Drug Saf 2011;20(1):1-11), SOS, ARITMO, SAFEGUARD, GRIP and ADVANCE.


An approach to expedite the analysis of heterogeneity, called the component strategy, was initially developed in the EMIF project and could also be compatible with the generalised common data model (see Identifying Cases of Type 2 Diabetes in Heterogeneous Data Sources: Strategy from the EMIF Project. PLoS ONE. 2016;11(8):e0160648).


v) Local data extraction and central analysis, generalised common data model


Two examples of research networks which use a generalised CDM are the Sentinel Initiative (as described in The U.S. Food and Drug Administration's Mini-Sentinel Program, Pharmacoepidemiol Drug Saf 2012;21(S1):1–303) and OHDSI. The main advantage of a general CDM is that it can be used for virtually any study involving the database. OhDSI is based on the Observational Medical Outcomes Partnership (OMOP) CDM which is used by many organisations and has been tested for its suitability for safety studies (see for example Validation of a common data model for active safety surveillance research. J Am Med Inform Assoc. 2012;19(1):54–60). OMOP also developed an open source repository for the analytical tools created within the project.


In A Comparative Assessment of Observational Medical Outcomes Partnership and Mini-Sentinel Common Data Models and Analytics: Implications for Active Drug Safety Surveillance (Drug Saf. 2015;38(8):749-65), it is suggested that slight conceptual differences between the Sentinel and the OMOP models do not significant impact on identifying known safety associations. Differences in risk estimations can be primarily attributed to the choices and implementation of the analytic approach.

4.6.3. Challenges of different models


The different models presented above present many challenges:


Related to the scientific content

Related to the organisation of the network

  • Differences in culture and experience between academia, public institutions and private partners.

  • Differences in the type and quality of information contained within each mapped database.

  • Different ethical and governance requirements in each country regarding processing of anonymised or pseudo-anonymised healthcare data.

  • Choice of data sharing model and access rights of partners.

  • Issues linked to intellectual property and authorship.

  • Sustainability and funding mechanisms.

Each model has strengths and weaknesses in facing the above challenges (Data Extraction and Management in Networks of Observational Health Care Databases for Scientific Research: A Comparison of EU-ADR, OMOP, Mini-Sentinel and MATRICE Strategies (EGEMS. 2016 Feb)).  Experience has shown that many of these difficulties can be overcome by full involvement and good communication between partners, and a project agreement between network members defining roles and responsibilities and addressing issues of intellectual property and authorship.



Individual Chapters:


1. Introduction

2. Formulating the research question

3. Development of the study protocol

4. Approaches to data collection

4.1. Primary data collection

4.1.1. Surveys

4.1.2. Randomised clinical trials

4.2. Secondary data collection

4.3. Patient registries

4.3.1. Definition

4.3.2. Conceptual differences between a registry and a study

4.3.3. Methodological guidance

4.3.4. Registries which capture special populations

4.3.5. Disease registries in regulatory practice and health technology assessment

4.4. Spontaneous report database

4.5. Social media and electronic devices

4.6. Research networks

4.6.1. General considerations

4.6.2. Models of studies using multiple data sources

4.6.3. Challenges of different models

5. Study design and methods

5.1. Definition and validation of drug exposure, outcomes and covariates

5.1.1. Assessment of exposure

5.1.2. Assessment of outcomes

5.1.3. Assessment of covariates

5.1.4. Validation

5.2. Bias and confounding

5.2.1. Selection bias

5.2.2. Information bias

5.2.3. Confounding

5.3. Methods to handle bias and confounding

5.3.1. New-user designs

5.3.2. Case-only designs

5.3.3. Disease risk scores

5.3.4. Propensity scores

5.3.5. Instrumental variables

5.3.6. Prior event rate ratios

5.3.7. Handling time-dependent confounding in the analysis

5.4. Effect measure modification and interaction

5.5. Ecological analyses and case-population studies

5.6. Pragmatic trials and large simple trials

5.6.1. Pragmatic trials

5.6.2. Large simple trials

5.6.3. Randomised database studies

5.7. Systematic reviews and meta-analysis

5.8. Signal detection methodology and application

6. The statistical analysis plan

6.1. General considerations

6.2. Statistical analysis plan structure

6.3. Handling of missing data

7. Quality management

8. Dissemination and reporting

8.1. Principles of communication

8.2. Communication of study results

9. Data protection and ethical aspects

9.1. Patient and data protection

9.2. Scientific integrity and ethical conduct

10. Specific topics

10.1. Comparative effectiveness research

10.1.1. Introduction

10.1.2. General aspects

10.1.3. Prominent issues in CER

10.2. Vaccine safety and effectiveness

10.2.1. Vaccine safety

10.2.2. Vaccine effectiveness

10.3. Design and analysis of pharmacogenetic studies

10.3.1. Introduction

10.3.2. Identification of generic variants

10.3.3. Study designs

10.3.4. Data collection

10.3.5. Data analysis

10.3.6. Reporting

10.3.7. Clinical practice guidelines

10.3.8. Resources

Annex 1. Guidance on conducting systematic revies and meta-analyses of completed comparative pharmacoepidemiological studies of safety outcomes