16.5.3. Explainable AI
Artificial intelligence (AI) is a catch-all term for a set of tools and techniques that allow machines to do activities commonly described as requiring human-level intelligence. While no consensus on a definition of AI exists, a common trend is an analogy to human intelligence, however, this is unhelpful as it suggests an idea of Artificial General Intelligence, whereas current techniques and tools are dedicated to assist specific tasks, i.e., Artificial Narrow Intelligence.
Machine Learning (ML) is considered a subset of AI and reflects the ability of computers to identify and extract rules from data rather than those rules being explicitly coded by a human. Deep Learning (DL) is a subtype of ML with increased complexity of how it parses and analyses data. The rules identified by ML or DL applications constitute an algorithm and the outputs are often said to be data-driven, as opposed to rules explicitly coded by a human that form knowledge-based algorithms.
Natural language processing (NLP) sits at the interface of linguistics, computer science and AI and is concerned with providing machines with the ability to understand text and spoken words. NLP can be subset into statistical NLP, which uses ML or DL approaches and symbolic NLP, which uses a semantic rule-based methodology. Applications of AI in pharmacoepidemiology can be broadly classified into those that extract and structure some data and those that produce some insight.
18.104.22.168. Data extraction
AI techniques can be used to extract text data from unstructured documents transforming it into information available in a structured, research-ready format to which statistical techniques can be applied. A potential application being explored is in extracting data from medical notes, usually including a named-entity recognition, i.e., discovering mentions of entities of a specific class or group such as medication or diseases, and a relation extraction, allowing to relate sets of entities, e.g., a medicine and an indication.
The 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text (J Am Med Inform Assoc. 2011;18(5):552-6) presents three tasks: a concept extraction of medical concepts from patient reports; a classification task focused on assigning assertion types for medical problem concepts; and a relation classification task focused on assigning relation types that hold between medical problems, tests, and treatments. Multiple algorithms were compared showing promising results for concept extraction. In NEAR: Named entity and attribute recognition of clinical concepts (J Biomed Inform. 2022;130:104092), three DL models were created for the same data used in the 2010 i2b2 challenge and have showed an improvement in performance.
Some of the first applications of ML and NLP to extract information from clinical notes focused on the identification of adverse drug events in medical notes, as illustrated in publications such as A method for systematic discovery of adverse drug events from clinical notes (J Am Med Inform Assoc. 2015;22(6):1196-204), Detecting Adverse Drug Events with Rapidly Trained Classification Models (Drug Saf. 2019;42(1):147-56) and MADEx: A System for Detecting Medications, Adverse Drug Events, and Their Relations from Clinical Notes (Drug Saf. 2019;42(1):123-33).
Another common application for medical concept extraction from clinical text is the identification of a relevant set of patients, often referred to as computable phenotyping as exemplified in Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications (J Am Med Inform Assoc. 2010;17(5):507-13). Combining deep learning with token selection for patient phenotyping from electronic health records (Sci Rep. 2020;10(1):1432) describes the development of DL models to construct a computable phenotype directly from the medical notes.
A large body of research has focused on extracting information from clinical notes in electronic health records. The approach can also be applied with some adjustment to other sets of unstructured data, including spontaneous reporting systems, as reflected in Identifying risks areas related to medication administrations - text mining analysis using free-text descriptions of incident reports (BMC Health Serv Res. 2019;19(1):791), product information documentation such as presented in Machine learning-based identification and rule-based normalization of adverse drug reactions in drug labels (BMC Bioinformatics. 2019;20(Suppl. 21):707) or even literature screening for systematic reviews as explored in Technology-assisted title and abstract screening for systematic reviews: a retrospective evaluation of the Abstrackr machine learning tool (Syst Rev. 2018 Mar 12;7(1):45).
In the systematic review Use of unstructured text in prognostic clinical prediction models: a systematic review (J Am Med Inform Assoc. 2022 Apr 27;ocac058), data extraction from unstructured text was shown to be beneficial in most studies. However, data extraction from unstructured text data does not show perfect accuracy (or related metric) and may have wide variability with respect to model performance for the same data extraction task, as shown in ADE Eval: An Evaluation of Text Processing Systems for Adverse Event Extraction from Drug Labels for Pharmacovigilance (Drug Saf. 2021;44(1):83-94). Thus, the application of these techniques should consider the objective in terms of precision or recall. For instance, a model that identifies medical concepts in a spontaneous report of an adverse drug reaction from a patient and maps it to a medical vocabulary might preferably focus on achieving a high recall, as false positives can be picked up in the manual review of the potential signal, whereas models with high precision and low recall may introduce irretrievable loss of information. In other words, ML models to extract data are likely to introduce some error and thus the error tolerance for the specific application needs to be considered.
22.214.171.124. Data insights
In pharmacoepidemiology, data insights extracted with ML models are typically one of three categories: confounding control, clinical prediction models and probabilistic phenotyping.
Propensity score methods are a predominant technique for confounding control (see Chapter 126.96.36.199). In practice, the propensity score is most often estimated using a logistic regression model, in which treatment status is regressed on observed baseline characteristics. In Evaluating large-scale propensity score performance through real-world and synthetic data experiments (Int J Epidemiol. 2018;47(6):2005-14) and A comparison of machine learning algorithms and covariate balance measures for propensity score matching and weighting (Biom J. 2019;61(4):1049-72) ML models were explored as alternatives to traditional logistic regression with a view to improve propensity score estimation. The theoretical advantages of using ML models include an automatisation procedure, by dispensing the need for investigator-defined covariate selection, and better modelling of non-linear effects and interactions. However, most studies in this field use synthetic or plasmode data and applications in real-world data need to be further explored.
The concept of rule-based, knowledge-based algorithms and risk-based stratification is not new to medicine and healthcare, the Framingham risk score being one of the most well-known. Trends in the conduct and reporting of clinical prediction model development and validation: a systematic review (J Am Med Inform Assoc. 2022;29(5):983-9) shows that there is a growing trend to develop data-driven clinical prediction models. However, the problem definition is often not clearly reported, and the final model is often not completely presented. This trend was exacerbated with the COVID-19 pandemic, where over 400 papers on clinical prediction models were published (see Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal, BMJ. 2020;369:m1328). The authors also suggest that prediction models are poorly reported, and at high risk of bias such that their reported predictive performance is probably optimistic, which was confirmed for several models in Clinical prediction models for mortality in patients with covid-19: external validation and individual participant data meta-analysis (BMJ. 2022;378:e069881). This is common, as has been reported in External validation of new risk prediction models is infrequent and reveals worse prognostic discrimination (Journal of Clinical Epidemiology. 2015;68(1):25–34.). While guidelines for reporting that are specific for AI prediction models are still under development (Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence, BMJ Open 2021;11:e048008), the Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) statement can be used (BMJ 2015;350:g7594). Further, PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies (Ann Intern Med. 2019;170:51-58) supports the evaluation of prediction models. A review of checklists for reporting AI use is reported in Time to start using checklists for reporting artificial intelligence in health care and biomedical research: a rapid review of available tools (2022 IEEE 26th International Conference on Intelligent Engineering Systems (INES), IEEE 2022. p. 000015–20). A checklist for assessing bias in a ML algorithm is provided in A clinician's guide to understanding and critically appraising machine learning studies: a checklist for Ruling Out Bias Using Standard Tools in Machine Learning (ROBUST-ML) (European Heart Journal - Digital Health. 2022;3(2):125–40).
Clinical prediction models have also been applied for safety signal detection with some degree of success as exemplified in A supervised adverse drug reaction signalling framework imitating Bradford Hill's causality considerations (J Biomed Inform. 2015;56:356-68). For the evaluation of safety and utility, the Reporting guideline for the early stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI can be used (BMJ 2022;377:e070904).
Probabilistic phenotyping is another potential use of ML in pharmacoepidemiology. It refers to the development of a case definition using a set of labelled examples to train a model and the outputting of the probability of a phenotype as a continuous trait. It differs from ML-based computable phenotyping mentioned earlier, as probabilistic phenotyping takes a set of features and estimates a probability of a phenotype whereas for the computable phenotyping, the ML technique merely extracts information that identifies a relevant case.
Methods for diagnosis phenotyping are discussed in Methods for Clinical Evaluation of Artificial Intelligence Algorithms for Medical Diagnosis. (Radiology. 2023 Jan;306(1):20–31). Validation of phenotyping of outcomes in pharmacoepidemiology, but not specifically AI related, is discussed in Core concepts in pharmacoepidemiology: Validation of health outcomes of interest within real-world healthcare databases (Pharmacoepidemiology and Drug Safety. 2023;32(1):1–8).
Identifying who has long COVID in the USA: a machine learning approach using N3C data (Lancet Digit Health. 2022;S2589-7500(22)00048-6) describes the development of a probabilistic phenotype of patients with long COVID using ML models and showed a high accuracy. Probabilistic phenotyping can be applied in wider contexts. In An Application of Machine Learning in Pharmacovigilance: Estimating Likely Patient Genotype From Phenotypical Manifestations of Fluoropyrimidine Toxicity (Clin Pharmacol Ther. 2020; 107(4): 944–7), a ML model using clincal manifestations of adverse drug reactions is used to estimate the probability of having a specific genotype, known to be correlated with severe but varied outcomes.
As development of probabilistic phenotypes is likely to increase, tools to assess the performance characteristics such as PheValuator: Development and evaluation of a phenotype algorithm evaluator (J Biomed Inform. 2019;97:103258) become more relevant.
Another possible category of use is hypothesis generation in causal inference, but this requires further research. For instance, in Identifying Drug-Drug Interactions by Data Mining: A Pilot Study of Warfarin-Associated Drug Interactions (Circ Cardiovasc Qual Outcomes. 2016;9(6):621-628) known warfarin–drug interactions and unknown possible interactions were identified using random forests.
As AI decisions, predictions, extractions and other output can be incorrect, and sometimes especially so for a subgroup of people, it can cause risks and ethical concerns that must be investigated. As deep learning models are not directly interpretable, methods to explain their decisions have been developed. However, these provide only an approximation that might not resemble the underlying model and the performance is rarely tested.
In The false hope of current approaches to explainable artificial intelligence in health care (Lancet Digit Health. 2021;3(11):e745-e750), the authors show that incorrect explanations from current explainability methods can cause problems for decision making for individual patients, and they explain that these explainable AI methods are unlikely to achieve their asserted goals for patient-level decision support.
In Artificial intelligence in pharmacovigilance: A regulatory perspective on explainability (Pharmacoepidemiol Drug Saf. 2022;31(12):1308-1310) the authors argue that although by default pharmacovigilance models should require explainability, model performance may outweigh explainability in processes with high error-tolerance where, for instance, a human-in-the-loop is required, and the need for explainability should follow a risk-based approach.