Print page Resize text Change font-size Change font-size Change font-size High contrast

Home > Standards & Guidances > Methodological Guide

ENCePP Guide on Methodological Standards in Pharmacoepidemiology


Chapter 4: Approaches to data collection


4. Approaches to data collection

4.1. Primary data collection

      4.1.1. Surveys

      4.1.2. Randomised clnical trails

4.2. Secondary use of data

4.3. Patient registries

      4.3.1. Definitions

      4.3.2. Conceptual  differences between a registry and a study

      4.3.3. Methodological aspects

      4.3.4. Population registries

      4.3.5. Registries which capture special populations

      4.3.6. Disease registries in regulatory practice and health technology assessment

4.4. Spontaneous reports

4.5. Social media

      4.5.1. Definition

      4.5.2. Use in pharmacovigilance

      4.5.3. Challenges

      4.5.4. Data protection

4.6. Research networks networks for multi-database studies

      4.6.1. General considerations

      4.6.2. Models of studies using multiple data sources

      4.6.3. Challenges of different models



4. Approaches to data collection


There are two main approaches for data collection: collection of data specifically for a particular study (‘primary data collection’) or use of data already collected for another purpose, e.g. as part of administrative records of patient health care (‘secondary use of data). The distinction between primary data collection and secondary use of data is important for marketing authorisation holders as it implies different regulatory requirements for the collection and reporting of suspected adverse reactions, as described in Module VI of the Guideline on good pharmacovigilance practice (GVP) - Management and reporting of adverse reactions to medicinal products.

Secondary use of data has become a common approach used in pharmacoepidemiology due to the increasing availability of electronic healthcare records, administrative claims data and other already existing data sources (see Chapter 4.2 Secondary use of data) and due to its increased efficiency and lower cost. In addition, networking between centres active in pharmacoepidemiology and pharmacovigilance is rapidly changing the landscape of drug safety research in Europe, both in terms of networks of data and networks of researchers who can contribute to a particular study with a particular data source (see Chapter 4.6 Research Networks).

4.1. Primary data collection


The methodological aspects of primary data collection studies are well covered in the textbooks and guidelines referred to in the Introduction chapter. Annex 1 of Module VIII of the Good pharmacovigilance practice provides examples of different study designs based on prospective primary data collection such as cross-sectional study, prospective cohort study, active surveillance. Surveys and randomised controlled trials are also presented below as examples of primary data collection.

Studies using hospital or community-based primary data collection have allowed the evaluation of drug-disease associations for rare complex conditions that require very large source populations and in-depth case assessment by clinical experts. Classic examples are Appetite-Suppressant Drugs and the Risk of Primary Pulmonary Hypertension (N Engl J Med 1996;335:609-16), The design of a study of the drug etiology of agranulocytosis and aplastic anemia (Eur J Clin Pharmacol 1983;24:833-6) and Medication Use and the Risk of Stevens–Johnson Syndrome or Toxic Epidermal Necrolysis (N Engl J Med 1995;333:1600-8). For some conditions, case-control surveillance networks have been developed and used for selected studies and for signal generation and clarification, e.g. Signal generation and clarification: use of case-control data (Pharmacoepidemiol Drug Saf 2001;10:197-203).

4.1.1. Surveys


A survey is the collection of data on knowledge, attitudes, behaviour, practices, opinions, beliefs or feelings of selected groups of individuals, by asking them in person, on paper, by phone or online from some sampling frame. They generally have a cross-sectional design, but repeated measures overtime may apply for trends assessment.


Surveys have been used for a long time in fields such as marketing, social science and epidemiology. General guidance on constructing and testing the survey questionnaire, modes of data collection, sampling frames and ways to achieve representativeness can be found in general texts (Survey Sampling (L. Kish, Wiley, 1995) and Survey Methodology (R.M. Groves, F.J. Fowler, M.P. Couper et al., 2nd Edition, Wiley 2009). The book Quality of Life: the assessment, analysis and interpretation of patient-related outcomes (P.M. Fayers, D. Machin, 2nd Edition, Wiley, 2007) offers a comprehensive review of the theory and practice of developing, testing and analysing quality of life questionnaires in different settings.


Surveys have an important role in the evaluation of the effectiveness of risk minimisation measures (RMM) or of a risk evaluation and mitigation strategy (REMS) (see chapter 5.9). The application of methods described in these aforementioned textbooks needs adaptation for surveys to evaluate the effectiveness of RMM or REMS. For example, the extensive methods for questionnaire development of quality of life scales (construct, criterion and content validity, inter-rater and test-retest reliability, sensitivity and responsiveness) are not appropriate to questionnaires in RM which are often used only once. The EMA and FDA issued guidance documents on the conduct of surveys for RM which, together, encompass the selection of risk minimisation measures, study design, instrument development, data collection, processing and data analysis and presentation of results. This guidance include the EMA Guideline on good pharmacovigilance practices (GVP) Module XVI (2017), the FDA draft guidance for industry REMS Assessment: Planning and Reporting on REMS (2019) and the FDA Guidance on Survey Methodologies to Assess REMS Goals That Relate to Knowledge (2019).


A checklist to assess the quality of studies evaluating RM programs is provided in The RIMES Statement: A Checklist to Assess the Quality of Studies Evaluating Risk Minimization Programs for Medicinal Products (Drug Saf 2018;41(4): 389-401). The article Are Risk Minimization Measures for Approved Drugs in Europe Effective? A Systematic Review (Expert Opin Drug Saf 2019;18(5):443-54) highlights the need for improvement in the methods and presentation of results and for more hybrid designs that link survey data with health and safety outcomes as requested by regulators. This article also reports on low response rates found in many studies, allowing for the possibility of important bias. The response rate should therefore be reported in a standardised way in surveys to allow comparisons. Standard Definitions. Final Dispositions of Case Codes and Outcome Rates for Surveys (2016) of the American Association for Public Opinion Research provides standard definitions which can be adapted to RM surveys and the FDA Guidance on Survey Methodologies to Assess REMS Goals That Relate to Knowledge (2019) provides guidance for RM surveys.


The increasing use of online RMM require that survey methods adapt but should not sacrifice representativeness by accessing only populations which visit these websites. They should provide evidence that the results using these sampling methods are not biased. Similarly, the increasing use of health care professional and patient panels needs to ensure that survey methods do not sacrifice representativeness by accessing only self-selected participants in these panels and should provide evidence that the results are not biased by using these convenient sampling frames.


4.1.2. Randomised clinical trials


Randomised clinical trials is an experimental design that involves primary data collection. There are numerous textbooks and publications on methodological and operational aspects of clinical trials and they are not covered here. An essential guideline on clinical trials is the European Medicines Agency (EMA) Guideline for good clinical practice E6(R2), which specifies obligations for the conduct of clinical trials to ensure that the data generated in the trial are valid. From a legal perspective, the Volume 10 of the Rules Governing Medicinal Products in the European Union contains all guidance and legislation relevant for conduct of clinical trials. A number of documents are under revision.


The way clinical trials are conducted in the European Union (EU) will undergo a major change when the Clinical Trial Regulation (Regulation (EU) No 536/2014) will fully come into effect and will replace the existing Directive 2001/20/EC.

Hybrid data collection as used in pragmatic trials, large simple trials and randomised database studies are described in Chapter 5.6.

4.2. Secondary use of data


Secondary use of data refers to the utilisation of data already gathered for another purpose (e.g. electronic and non-electronic healthcare data). These can be further linked to prospectively collected data including medical and non-medical data. The last decades have witnessed the development of key data resources, expertise and methodology that have allowed use of such data for pharmacoepidemiology. The ENCePP Inventory of Data Sources contains information on existing European databases. However, this field is continuously evolving and it is recommended to look for recently published reviews and lists of databases.


A comprehensive description of the main features and applications of frequently used electronic healthcare databases for pharmacoepidemiology research in the United States and in Europe appears in the book Pharmacoepidemiology (B. Strom, S.E. Kimmel, S. Hennessy. 6th Edition, Wiley, 2019, Chapters 11 - 14). The limitations of using electronic healthcare databases should be acknowledged, as detailed in A review of uses of healthcare utilisation databases for epidemiologic research on therapeutics (J Clin Epidemiol 2005; 58: 23-337).


The primary purpose of the ISPE-endorsed Guidelines for Good Database Selection and use in Pharmacoepidemiology Research (Pharmacoepidemiol Drug Saf 2012;21:1-10) is to assist in the selection and use of data resources in pharmacoepidemiology by highlighting potential limitations and recommending correct procedures. This guideline refers to the secondary use of databases containing routinely collected healthcare information such as electronic medical records and claims databases and does not include spontaneous reporting databases. It is a simple, well-structured guideline that will help investigators to select the most suitable databases to address specific research question and helps database custodians to describe their database in a useful manner. An entire section is dedicated to the use of multi-database studies. The document also contains references to data quality and validation procedures, data processing/transformation, privacy and security.


The Guidelines and recommendations for ensuring Good Epidemiological Practice (GEP): a guideline developed by the German Society for Epidemiology (European Journal of Epidemiology 2019;34(3):301-17) provide detailed recommendations on all aspects of the design and conduct of epidemiological studies, and many of these recommendations address aspects to be considered when making secondary use of data. The FDA’s Best Practices for Conducting and Reporting Pharmacoepidemiologic Safety Studies Using Electronic Health Care Data Sets provides criteria for best practice that apply to design, analysis, conduct and documentation. It emphasizes that investigators should understand the potential limitations of electronic healthcare data systems, make provisions for their appropriate use and refer to validation studies of safety outcomes of interest in the proposed study and captured in the database. Guidance for conduction studies within electronic healthcare databases can also be found in the International Society for Pharmacoepidemiology Guidelines for Good Pharmacoepidemiology Practices (ISPE GPP), in particular sections IV-B (Study conduct, Data collection). This guidance emphasizes the importance of patient data protection.


The concepts of “Real-world data” (RWD) and “Real-world evidence” (RWE) are increasingly used in the regulatory setting to denote the secondary use of observational data and pharmacoepidemiological methods for regulatory decision-making. The article Real-World Data for Regulatory Decision Making: Challenges and Possible Solutions for Europe. (Clin Pharmacol Ther. 2019;106(1):36-9) describes the operational, technical and methodological challenges for the acceptability of real-world data for regulatory purposes and presents possible solutions to address these challenges. The FDA’s Real-World Evidence website also provides definitions and links to a set of useful guidelines on the submission and use of real-world data, including electronic health care databases, to support decision-making. The Joint ISPE-ISPOR Special Task Force Report on Good Practices for Real‐World Data Studies of Treatment and/or Comparative Effectiveness recommends good research practices for designing and analysing retrospective databases for comparative effectiveness research (CER). and reviews methodological issues and possible solutions for CER studies based on secondary data analysis (see also Chapter 10.1 on comparative effectiveness research). Many of the principles are applicable to studies with other objectives than CER, but aspects of pharmacoepidemiological studies based on secondary use of data, such as data quality, ethical issues, data ownership and privacy, are not covered.


The majority of the examples and methods covered in Chapter 5 are based on studies and methodologic developments in secondary data collection, since this is the most frequent approach used in pharmacoepidemiology.  Several potential issues need to be considered in the use of electronic healthcare data for pharmacoepidemiological studies as they may affect the validity of the results. They include completeness of data capture, bias in the assessment of exposure, outcome and covariates, variability between data sources and the impact of changes over time in data, access methodology and the healthcare system.

Chapter 4.6. deals with models of studies conducted across multiple data sources.

4.3. Patient registries


4.3.1. Definitions


A patient registry is an organised system that uses observational methods to collect uniform data on specified outcomes in a population defined by a particular disease, condition or exposure. A registry-based study is an investigation of a research question using a patient registry infrastructure for patient recruitment and data collection. The term ‘registry’ is sometimes used incorrectly to designate a cohort study with primary data collection or a list of all patients meeting the eligibility criteria for a study (the term ‘patient log’ or ‘patient log-list’ could be used for the latter purpose).


A patient registry should be considered as an infrastructure for the standardised recording of data from routine clinical practice on individual patients identified by a characteristic or an event, for example the diagnosis of a disease (disease registry), the occurrence of a condition (e.g., pregnancy registry), a birth defect (e.g. birth defect registry), a molecular or a genomic feature or any other patient characteristics, or an encounter with a particular healthcare service. The term product registry is sometimes used for a system where data are collected on patients exposed to a particular medicinal product, single substance or therapeutic class in order to evaluate their use or their effects, but such system should rather be considered a clinical trial or a non-interventional study as data is collected for a specific pre-planned analysis purpose in line with performing a trial/study.


4.3.2. Conceptual differences between a registry and a study


As illustrated in Imposed registries within the European postmarketing surveillance system (Pharmacoepidemiol Drug Saf 2018; 27(7):823-826), there are methodological differences between registries and registry-based studies.


Patient registries are often integrated into routine clinical practice with systematic and sometimes automated data capture in electronic healthcare records. Whilst the duration of a registry is normally open-ended, that of a registry-based study is dictated by the time needed to define and collect data relevant for the specific study objectives. Studies may also require introduction of specific procedures, questionnaires or data collection tools. Studies are set up and managed based on a limited number of endpoints and a specific protocol, whereas patient registries should focus on system specifications in order to ensure continuous, efficient and collaborative data collection, safe data hosting and availability of retrievable, interoperable and re-usable data.

A registry can be used as a source of patients for studies based on either primary data collection (where the events of interest for the study are collected directly from the patients, caregivers, healthcare professionals or other persons involved in the patient care) or secondary use of data already collected (where the study uses data collected for another purpose, analogously to the use of electronic healthcare records). For this purpose, registry data can be enriched with additional information on outcomes, lifestyle data, immunisation or mortality information obtained from linkage to the existing databases such as national cancer registries, prescription databases or mortality records.

4.3.3. Methodological aspects


To support better use of existing registries for the benefit-risk evaluation of medicines, the EU regulatory network developed the Patient registries initiative. As part of this initiative, the European Medicines Agency organised several workshops on disease-specific registries. The reports of these workshops describe regulators’ expectation on common data elements to be collected and best practices on topics such as governance, data quality control, data sharing or reporting of safety data. The ENCePP Resource database of data sources is also used to support an inventory of existing disease registries.


The EMA’s Scientific Advice Working Party issued two Qualification Opinions for two registry platforms, the ECFSPR and the EBMT, with an evaluation of their potential use as data sources for registry-based studies. Although they apply only to two registry platforms, these opinions provide a good indication of the key methodological components expected by regulators for using a disease registry for such studies.


The US Agency for Health Care Research and Quality (AHRQ) published a comprehensive document on ‘good registry practices’ entitled Registries for Evaluating Patient Outcomes: A User's Guide, 3rd Edition, which provides methodological guidance on planning, design, implementation, analysis, interpretation and evaluation of the quality of a registry. There is a dedicated section for linkage of registries to other data sources. The EU PARENT Joint Action developed Methodological guidelines and recommendations for efficient and rational governance of patient registries to facilitate cross-border use of registries. 


Results obtained from analyses of registry data may be affected by the same biases as those of studies described in Chapter 5.2 Bias and confounding. Registry-based studies are sensitive to selection bias. This is due to the fact that factors that may influence the enlistment of patients in a registry may be numerous (including clinical, demographic and socio-economic factors) and difficult to predict and identify, potentially resulting in a biased sample of the patient population in case the recruitment has not been exhaustive. In addition, registry-based studies may also introduce selection bias in the recruitment or selection of registered patient for the specific study, as well as in the differential completeness of follow-up and data collection. It is therefore important to systematically compare the characteristics of the study population with those of the source population.

As illustrated in The randomized registry trial--the next disruptive technology in clinical research? (N Engl J Med 2013; 369: 1579-81) and Registry-based randomized controlled trials: what are the advantages, challenges and areas for future research? (J Clin Epidemiol 2016;80:16-24), the randomised registry-based trial may support enhanced generalisability of findings, rapid consecutive enrollment, and the potential completeness of follow-up for the reference population, when compared with conventional randomized effectiveness trials, but several challenges need to be considered (see also Chapter 5.6.3).

4.3.4. Population registries


In European Nordic countries, a comprehensive registration of data for a high proportion or all of the population allows linkage between government-administered patient registries that may include hospital encounters, diagnoses and procedures, such as the Norwegian Patient Registry, the Danish National Patient Registry or the Swedish National Patient Register. They may however lack information on lifestyle factors, patient-related outcomes and laboratory data. A Review of 103 Swedish Healthcare Quality Registries (J Intern Med 2015; 277(1): 94–136) describes additional healthcare quality registries focusing on specific disorders initiated in Sweden mostly by physicians with data on aspects of disease management, self-reported quality of life, lifestyle, and general health status, providing an important source for research.


4.3.5. Registries which capture special populations


Special populations can be identified based on age (e.g., paediatric or elderly), pregnancy status, renal or hepatic function, race, or genetic differences. Some registries are focused on these particular populations. Examples of these are the birth registries in Nordic countries and registries for rare diseases. The European Platform on Rare Diseases Registration (EU RD Platform) serves as platform for information on registries for rare diseases and has developed a set of common data elements for the European Reference Network and other rare disease registries.


The FDA’s Draft Postapproval Pregnancy Safety Studies Guidance for Industry (May 2019) include recommendations for designing a pregnancy registry with a description of research methods and elements to be addressed. The Systematic overview of data sources for Drug Safety in pregnancy research provides an inventory of pregnancy exposure registries and alternative data sources on safety of prenatal drug exposure and discusses their strengths and limitations. Example of population-based registers allowing to assess outcome of drug exposure during pregnancy are the European network of registries for the epidemiologic surveillance of congenital anomalies EUROCAT, and the pan-Nordic registries which record drug use during pregnancy as illustrated in Selective serotonin reuptake inhibitors and venlafaxine in early pregnancy and risk of birth defects: population based cohort study and sibling design (BMJ 2015;350:h1798).


For paediatric populations, specific and detailed information as neonatal age (e.g. in days), pharmacokinetic parameters and organ maturation need to be considered and is usually missing from the classical datasources, therefore paediatric specific registries are important. The CHMP Guideline on Conduct of Pharmacovigilance for Medicines Used by the Paediatric Population provides further relevant information. An example of registry which focuses on paediatric patients is Pharmachild, which captures children with juvenile idiopathic arthritis undergoing treatment with methotrexate or biologic agents.

Other registries that focus on special populations (e.g., the UK Renal Registry) can be found in the ENCePP Inventory of data sources.

4.3.6 Disease registries in regulatory practice and health technology assessment


The article Patient Registries: An Underused Resource for Medicines Evaluation: Operational proposals for increasing the use of patient registries in regulatory assessments (Drug Saf. 2019;42(11):1343-1351) proposes sets of measures to improve use of registries in relation to: (1) nature of the data collected and registry quality assurance processes; (2) registry governance, informed consent, data protection and sharing; and (3) stakeholder communication and planning of benefit-risk assessments. Appendix 1 of Module VIII of the Good pharmacovigilance practice discusses the use of registries for conducting post authorisation studies. The use of registries to support the post-authorisation collection of data on effectiveness and safety of medicinal products in the routine treatment of diseases is also discussed in the EMA Scientific guidance on post-authorisation efficacy studies. Use of existing disease registries is recommended as they allow continued assessment of disease outcomes and a comparison of different treatment options using a similar methodology. Data of existing registries could be supplemented with additional data collection or linkage to external data sources.


When efficacy has been demonstrated in RCTs, registry-based studies may also be useful to study aspects related to long term effectiveness and safety in heterogeneous populations, study effect modifiers such as doses that have been prescribed by physicians and that may differ from those used in RCTs, and study patient sub-groups defined by variables such as age, co-morbidities, use of concomitant medication or genetic factors, or other factors that might influence effectiveness or safety. 


Incorporating data from clinical practice into the drug development process is a growing interest from health technology assessment (HTA) bodies and payers since reimbursement decisions can benefit from better estimation and prediction of effectiveness of treatments at the time of product launch. An example of where registries can provide clinical practice data is the building of predictive models that incorporate data from both RCTs and registries to generalise results observed in RCTs to a real-world setting. In this context, the EUnetHTA Joint Action 3 project has issued the Registry Evaluation and Quality Standards Tool (REQueST) aiming to guide the evaluation of registries for effective usage in HTA.


4.4. Spontaneous reports


Spontaneous reports of adverse drug effects remain a cornerstone of pharmacovigilance and are collected from a variety of sources, including healthcare providers, national authorities, pharmaceutical companies, medical literature and more recently directly from patients. EudraVigilance is the European Union data processing network and management system for reporting and evaluation of suspected adverse drug reactions (ADRs). The Global Individual Case Safety Reports Database System (VigiBase) pools reports of suspected ADRs from the members of the WHO programme for international drug monitoring. These systems deal with the electronic exchange of Individual Case Safety Reports (ICSRs), the early detection of possible safety signals and the continuous monitoring and evaluation of potential safety issues in relation to reported ADRs. The report Characterization of databases (DB) used for signal detection (SD) of the PROTECT project shows the heterogeneity of spontaneous databases and the lack of comparability of SD methods employed. This heterogeneity is an important consideration when assessing the performance of SD algorithms.


The strength of spontaneous reporting systems is that they cover all types of legal drugs used in any setting. In addition to this, the reporting systems are built to obtain information specifically on potential adverse drug reactions and the data collection concentrates on variables relevant to this objective and directs reporters towards careful coding and communication of all aspects of an ADR. The increase in systematic collection of ICSRs in large electronic databases has allowed the application of data mining and statistical techniques for the detection of safety signals. There are known limitations of spontaneous ADR reporting systems, which include limitations embedded in the concept of voluntary reporting, whereby known or unknown external factors may influence the reporting rate and data quality. ICSRs may be limited in their utility by a lack of data for an accurate quantification of the frequency of events or the identification of possible risk factors for their occurrence. For these reasons, the concept is now well accepted that any signal from spontaneous reports needs to be verified clinically before further communication.


One challenge in spontaneous report databases is report duplication. Duplicates are separate and unlinked records that refer to one and the same case of a suspected ADR and may mislead clinical assessment or distort statistical screening. They are generally detected by individual case review of all reports or by computerised duplicate detection algorithms. In Performance of probabilistic method to detect duplicate individual case safety reports (Drug Saf 2014;37(4):249-58) a probabilistic method highlighted duplicates that had been missed by a rule-based method and also improved the accuracy of manual review. In the study, however, a demonstration of the performance of de-duplication methods to improve signal detection is lacking. The FDA have also implemented probabilistic duplicate detection in the FAERS and VAERS databases. A novel feature is an attempt to use narrative text analysed via NLP methods as demonstrated in Using Probabilistic Record Linkage of Structured and Unstructured Data to Identify Duplicate Cases in Spontaneous Adverse Event Reporting Systems (Drug Saf 2017;40(7):571–58).


Validation of statistical signal detection procedures in EudraVigilance post-authorisation data: a retrospective evaluation of the potential for earlier signalling (Drug Saf 2010;33: 475 – 87) has shown that the statistical methods applied in EudraVigilance can provide significantly early warning in a large proportion of Drug Safety problems. Nonetheless, this approach should supplement, rather than replace, other pharmacovigilance methods.

Chapters IV and V of the Report of the CIOMS Working Group VIII ‘Practical aspects of Signal detection in Pharmacovigilance’ present sources and limitations of spontaneously-reported drug-safety information and databases that support signal detection. Appendix 3 of the report provides a list of international and national spontaneous reporting system databases.

4.5. Social media


4.5.1. Definition


Technological advances have dramatically increased the range of data sources that can be used to complement traditional ones and may provide compelling insights into effectiveness and safety of interventions. Such data include digital media that exist in a computer-readable format as websites, web pages, blogs, vlogs, social networking sites, internet forums, chat rooms, health portals. A recent addition to this list is represented by the biomedical data collected through wearable technology (e.g., heart rate, physical activity and sleep pattern, dietary patterns). This data is unsolicited and generated in real time.


Social media is considered as a sub-set of digital media. The European Commission’s Digital Single Market Glossary defines social media as “a group of Internet-based applications that build on the ideological and technological foundations of Web 2.0 and that allow the creation and exchange of user-generated content. It employs mobile and web-based technologies to create highly interactive platforms via which individuals and communities share, co-create, discuss, and modify user-generated content.


4.5.2. Use in pharmacovigilance


Social media has been used to provide insights into the patient’s perception of the effectiveness of drugs and for the collection of patient reported outcomes, as discussed in Web-based patient-reported outcomes in Drug Safety and risk management: challenges and opportunities? (Drug Saf 2012;35(6):437-46).


The IMI WEB-RADR European collaborative project explored different aspects related to the use of social media data as a basis for pharmacovigilance and summarised its recommendations in Recommendations for the Use of Social Media in Pharmacovigilance: Lessons From IMI WEB-RADR (Drug Saf 2019;42(12):1393-1407). The French Vigi4Med project, which evaluated the use of social media, mainly web forums, for pharmacovigilance activities,has published a set of recommendation in Use of Social Media for Pharmacovigilance Activities: Key Findings and Recommendations from the Vigi4Med Project (Drug Saf. 2020;10.1007/s40264-020-00951-2 [published online ahead of print, 2020 Jun 16]).


One possible use of social media would be source of information for signal detection or assessment. Studies including Using Social Media Data in Routine Pharmacovigilance: A Pilot Study to Identify Safety Signals and Patient Perspectives (Pharm Med 2017;31(3): 167-74) and Assessment of the Utility of Social Media for Broad-Ranging Statistical Signal Detection in Pharmacovigilance: Results from the WEB-RADR Project (Drug Saf 2018;41(12):1355–1369) have evaluated whether analysis of social media data (specifically Facebook and Twitter posts) could identify pharmacovigilance signals early, but in their respective settings, found that this was not the case.


Using Social Media Data in Routine Pharmacovigilance: A Pilot Study to Identify Safety Signals and Patient Perspectives (Pharm Med 2017;31(3): 167-74) also tried to determine the quantity of posts with resemblance to adverse events and the types and characteristics of products that would benefit from social media analysis. It concludes that, although analysis of data from social media did not identify new safety signals, it can provide unique insight into the patient perspective.

From a regulatory perspective, social media is a source of potential reports of suspected adverse drug reactions and marketing authorisation holders are legally obliged to screen web sites under their management and assess whether reports of adverse reactions qualify for spontaneous reporting (see Good Pharmacovigilance practice Module VI- (Rev. 2), Chapter VI.B.1.1.4). Principles for continuous monitoring of the safety of medicines without overburdening established pharmacovigilance systems and a regulatory framework on the use of social media in pharmacovigilance have been proposed in Establishing a Framework for the Use of Social Media in Pharmacovigilance in Europe (Drug Saf. 2019;42(8):921-30).

4.5.3. Challenges


While offering the promise of new research models and approaches, the rapidly evolving social media environment presents many challenges including the need for strong and systematic processes for selection, validation and study implementation. Articles which detail associated challenges are: Evaluating Social Media Networks in Medicines Safety Surveillance: Two Case Studies (Drug Saf 2015; 38(10): 921-30.) and Social media and pharmacovigilance: A review of the opportunities and challenges (Br J Clin Pharmacol 2015; 80(4): 910-20).

There is currently no defined strategy or framework in place in order to meet the standards around data validity, generalisability for this type of data, and their regulatory acceptance may therefore be lower than for traditional sources. However, more tools and solutions for analysing unstructured data are becoming available, especially for pharmacoepidemiology and Drug Safety research, as in Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in Twitter posts (J Am Med Inform Assoc 2017 Feb 22) and Social Media Listening for Routine Post-Marketing Safety Surveillance (Drug Saf 2016;39(5):443-54). However, the recognition and disambiguation of references to drugs and adverse events in free text remains a challenge and performance evaluations need to be critically assessed as discussed in Prospective Evaluation of Adverse Event Recognition Systems in Twitter: Results from the Web-RADR Project (Drug Saf 2020;10.1007/s40264-020-00942-3 [published online ahead of print, 2020 May 14]).

4.5.4. Data protection


The EU General Data Protection Regulation (GDPR) introduces EU-wide legislation on personal data and security. It specifies that the impact of data protection at the time of study design concept should be assessed and reviewed periodically. Other technical documents may also be applicable such as Smartphone Secure Development Guidelines (2011) published by the European Network and Information Security Agency (ENISA), which advises on design and technical solutions. The principles of these security measures are found in the European Data Protection Supervisor (EDPS) opinion on mobile health (Opinion 1/2015 Mobile Health-Reconciling technological innovation with data protection).


4.6. Research networks for multi-database studies


4.6.1. General considerations


Pooling data across different databases affords insight into the generalisability of the results and may improve precision. A growing number of studies use data from networks of databases, often from different countries. Some of these networks are based on long-term contracts with selected partners and are very well structured (such as Sentinel, the Vaccine Safety Datalink (VSD) or the Canadian Network for Observational Drug Effect Studies (CNODES), but others are looser collaborations based on an open community principle (e.g. Observational Health Data Sciences and Informatics (OHDSI). In Europe, collaborations for multi-database studies have been strongly encouraged by the Drug Safety research funded by the European Commission (EC) and public-private partnerships such as the Innovative Medicines Initiative (IMI). This funding resulted in the conduct of groundwork necessary to overcome the hurdles of data sharing across countries for specific projects (e.g. PROTECT, ADVANCE, EMIF, EHDEN) or for specific post-authorisation studies.


In this chapter, networking is used to mean collaboration between investigators for sharing expertise and resources. The ENCePP Database of Research Resources may facilitate such networking by providing an inventory of research centres and data sources that can collaborate on specific pharmacoepidemiology and pharmacovigilance studies in Europe. It allows the identification of centres and data sets by country, type of research and other relevant fields.


The use of research networks in drug safety analyses is well established and a significant body of practical experience exists. By contrast, no consensus exists on the use of such networks, or indeed of single sources of observational data, in estimating effectiveness. In particular, the use in support of licensing applications will require evaluations of the reliability of results and the verifiability of research processes that are currently at an early stage. Specific advice on effectiveness can only be given once this work has been done and incorporated into regulatory guidelines. Hence this discussion currently relates only to product safety (see Assessing strength of evidence for regulatory decision making in licensing: What proof do we need for observational studies of effectiveness?; Pharmacoepidemiol. Drug Saf. 2020 Apr 16).

From a methodological point of view, research networks have many advantages over single database studies:

  • In case of primary data collection, shorten the time needed for obtaining the desired sample size and speed-up investigation of drug safety issues or other outcomes.


  • Benefit from the heterogeneity of treatment options across countries, which allows studying the effect of different drugs used for the same indication or specific patterns of utilisation.


  • May provide additional knowledge on the generalisability of results and on the consistency of information, for instance whether a safety issue exists in several countries. Possible inconsistencies might be caused by different biases or truly different effects in the databases revealing causes of differential drug effects, and these might be investigated.


  • Involve experts from various countries addressing case definitions, terminologies, coding in databases and research practicesprovides opportunities to increase consistency of results of observational studies.


  • Allow pooling data or results and increase the amount of information gathered for a specific issue addressed in different databases.


The article Different strategies to execute multi-database studies for medicines surveillance in real world setting: a reflection on the European model (Clin. Pharmacol. Ther. 2020 Apr 3) describes different models applied for combining data or results from multiple databases. A common characteristic of all models is the fact that data partners maintain physical and operational control over electronic data in their existing environment and therefore the data extraction is always done locally. Differences however exist in the following areas: use of a common protocol; use of a common data model (CDM); and where and how the data analysis is done.

Use of a common data model (CDM) implies that local formats are translated into a predefined, common data structure, which allows launching a similar data extraction and analysis script across several databases. Sometimes the CDM imposes a common terminology as well, as in the case of the OMOP CDM. The CDM can be systematically applied on the entire database (generalised CDM) or on the subset of data needed for a specific study (study specific CDM). In the EU, study specific CDMs have generated results in several projects and studies and initial steps have been taken to create generalised CDMs, but experience based on real-life studies is still limited. An example is the study Safety of hydroxychloroquine, alone and in combination with azithromycin, in light of rapid wide-spread use for COVID-19: a multinational, network cohort and self-controlled case series study.

4.6.2. Models of studies using multiple data sources


Five models of studies are presented, classified according to specific choices in the steps needed to execute a study: protocol development and agreement (whether separate or common); where the data are extracted and analysed (locally or centrally); how the data are extracted and analysed (using individual or common programs); and use of a CDM and which type (study specific or general) (see Table 1). Meta-analysis: separate protocols, local and individual data extraction and analysis, no CDM


The traditional mode to combine data from multiple data sources is when data extraction and analysis are performed independently at each centre based on separate protocols. This is usually followed by meta-analysis of the different estimates obtained (see Chapter 5.7).

This type of model may be viewed as a baseline situation which a research network will try to improve. Moreover, meta-analysis should be used in all models of studies presented, as there is always the possibility that different data sources provides different results and hence explicitly looking for such variation should always be considered. If all the data sources can be accessed, explaining variations in term of covariates should also be attempted. This is coherent with the recommendations from Multi-centre, multi-database studies with common protocols: lessons learnt from the IMI PROTECT project (Pharmacoepidemiol. Drug Saf. 2016;25(S1):156-165) that states that a priori pooling of data from several databases may disguise heterogeneity that may provide useful information on the safety issue under investigation. On the other hand, parallel analysis of databases allows exploring reasons for heterogeneity through extensive sensitivity analyses. This approach eventually increases consistency in findings from observational drug effect studies or reveal causes of differential drug effects. Local analysis: common protocol, local and individual data extraction and analysis, no CDM


In this option, data are extracted and analysed locally, with site-specific programs that are developed by each centre, on the basis of a common protocol. Definitions of exposure, outcomes and covariates, analytical programmes and reporting formats are standardised according to a common protocol and the results of each analysis, either at a patient level or in an aggregated format depending on the governance of the network, are shared and pooled together through meta-analysis.

This approach allows assessment of database or population characteristics and their impact on estimates but reduces variability of results determined by differences in design. Examples of research networks that use the common protocol approach are PROTECT (as described in Improving Consistency and Understanding of Discrepancies of Findings from Pharmacoepidemiological Studies: the IMI PROTECT Project. (Pharmacoepidemiol Drug Saf 2016;25(S1): 1-165) and the Canadian Network for Observational Drug Effect Studies (CNODES). The latter is experimenting with a CDM as explained in Building a framework for the evaluation of knowledge translation for the Canadian Network for Observational Drug Effect Studies (Pharmacoepidemiol. Drug Saf. 2020;29 (S1),8-25)

This approach requires very detailed common protocols and data specifications that reduce variability in interpretations by researchers. Sharing of raw data: common protocol, local and individual data extraction, central analysis, no CDM


In this approach, a mutually agreed protocol is agreed by the study partners. Data intended to be used for the study are locally extracted with site-specific programs, transferred without analysis and conversion to a CDM, and pooled and analyzed at the central partner receiving them.

Examples for this approach are when databases are very similar in structure and content as is the case for some Nordic registries, or on the Italian regional databases. Examples of such models are Selective serotonin reuptake inhibitors during pregnancy and risk of persistent pulmonary hypertension in the newborn: population based cohort study from the five Nordic Countries (BMJ 2012;344:d8012) and All‐cause mortality and antipsychotic use among elderly persons with high baseline cardiovascular and cerebrovascular risk: a multi‐center retrospective cohort study in Italy (Expert Opin. Drug Metab. Toxicol. 2019;15:179-88).

The central analysis allows removing an additional source of variability linked to the statistical programing and analysis. Study specific CDM: common protocol, local and individual data extraction, local and common analysis, study specific CDM


In this approach, a mutually agreed protocol is agreed by the study partners and data intended to be used for the study are locally extracted and loaded into a CDM; data in the CDM are then processed locally in all the sites with one common program. The output of the common program is transferred to a specific partner. The output to be shared may be an analytical dataset or study estimates, depending on the governance of the network.

Examples of research networks that used this approach by employing a study-specific CDM with transmission of anonymised patient-level data (allowing a detailed characterisation of each database) are EU-ADR (as explained in Combining multiple healthcare databases for postmarketing drug and vaccine safety surveillance: why and how?, J Intern Med 2014;275(6):551-61), SOS, ARITMO, SAFEGUARD, GRIP, EMIF, EUROmediCAT and ADVANCE. In all these projects, a basic and simple CDM was utilised and R, SAS, STATA or Jerboa scripts have been used to create and share common analytics. Diagnosis codes for case finding can be mapped across terminologies by using the Codemapper, developed in the ADVANCE project, as explained in CodeMapper: semiautomatic coding of case definitions (Pharmacoepidemiol Drug Saf 2017;26(8):998-1005).

An approach to quantify the impact of different case finding algorithms, called the component strategy, was developed in the EMIF and ADVANCE projects and could also be compatible with the simple and generalised common data model (see Identifying Cases of Type 2 Diabetes in Heterogeneous Data Sources: Strategy from the EMIF Project. PLoS One 2016;11(8):e0160648). General CDM: common protocol, local and common data extraction and analysis, general CDM


In this approach, the local databases are transformed into a CDM prior to and independent of any study protocol. When a study is required, a protocol is agreed by the study partners and a centrally developed analysis program is created that runs locally on each database to extract and analyse the data. The output of the common programs shared may be an analytical dataset or study estimates, depending on the governance of the network.


Two examples of research networks which use a generalised CDM are the Sentinel Initiative (as described in The U.S. Food and Drug Administration's Mini-Sentinel Program, Pharmacoepidemiol Drug Saf 2012;21(S1):1–303) and OHDSI. The main advantage of a general CDM is that it can be used for virtually any study involving that database. OHDSI is based on the Observational Medical Outcomes Partnership (OMOP) CDM which is now used by many organisations and has been tested for its suitability for safety studies (see for example Validation of a common data model for active safety surveillance research. J Am Med Inform Assoc 2012;19(1):54–60 and Can We Rely on Results From IQVIA Medical Research Data UK Converted to the Observational Medical Outcome Partnership Common Data Model?: A Validation Study Based on Prescribing Codeine in Children (Clin Pharmacol Ther 2020;107(4):915-25)). Conversion into the OMOP CDM, requires formal mapping of database items to standardised concepts. This is resource intensive and will need to be updated every time the databases is refreshed. An example of a study performed with the OMOP CDM in Europe is Safety of hydroxychloroquine, alone and in combination with azithromycin, in light of rapid wide-spread use for COVID-19: a multinational, network cohort and self-controlled case series study.


In a Comparative Assessment of Observational Medical Outcomes Partnership and Mini-Sentinel Common Data Models and Analytics: Implications for Active Drug Safety Surveillance (Drug Saf 2015;38(8):749-65), it is suggested that slight conceptual differences between the Sentinel and the OMOP models do not significantly impact on identifying known safety associations. Differences in risk estimations can be primarily attributed to the choices and implementation of the analytic approach.

Table 1: Models of studies using multiple data sources: key characteristics following the steps needed to execute a study

For larger version of the table click here.



4.6.3. Challenges of different models


The different models presented above present several challenges:


Related to the scientific content

  • Differences in the underlying health care systems

  • Different mechanisms of data generation and collection

  • Mapping of differing disease coding systems (e.g., the International Classification of Disease, 10th Revision (ICD-10), Read codes, the International Classification of Primary Care (ICPC-2)) and narrative medical information in different languages

  • Validation of study variables and access to source documents for validation

Related to the organisation of the network

  • Differences in culture and experience between academia, public institutions and private partners

  • Differences in the type and quality of information contained within each mapped database

  • Different ethical and governance requirements in each country regarding processing of anonymised or pseudo-anonymised healthcare data

  • Choice of data sharing model and access rights of partners

  • Issues linked to intellectual property and authorship.

  • Sustainability and funding mechanisms.

Each model has strengths and weaknesses in facing the above challenges, as illustrated in Data Extraction and Management in Networks of Observational Health Care Databases for Scientific Research: A Comparison of EU-ADR, OMOP, Mini-Sentinel and MATRICE Strategies (eGEMs 2016;4(1):2). In particular, a central analysis or a CDM provide protection from problems related to variation in how protocols are implemented as individual analysts might implement protocols differently (as described in Quantifying how small variations in design elements affect risk in an incident cohort study in claims; Pharmacoepidemiol. Drug Saf. 2020;29(1):84-93). Experience has shown that many of these difficulties can be overcome by full involvement and good communication between partners, and a project agreement between network members defining roles and responsibilities and addressing issues of intellectual property and authorship. Several of the networks have made their code, products data models and analytics software publicly available as OHDSI, Sentinel, ADVANCE.

Timeliness or speed for running studies is important in order to meet short regulatory timelines in circumstances where prompt decisions are needed. Solutions need therefore to be further developed and introduced to be able to run multi-database studies with shorter timelines. Independently from the model used, major factors that should be considered in speeding up studies include having work independent of any particular study already done. This includes factors such as: prespecified agreements on data access and processes for protocol development and study management, identification and characterisation of a large set of databases, creation of common definitions for variables that seem likely to occur in studies, and a common analytical systems where the most typical and routine analyses are already defined (this latter point is made easier with the use of CDMs, especially general ones, with standardised analytics and tools that can be re-used to support faster analysis).


« Back to main table of contents