Responsible and evidence-based AI: 5 years on

“The answer to our current health crisis is AI; AI is better than doctors and we should be using it now.”

and

“AI is completely different, we’re not ready for it and it’s a Wild West out there…”

Do these statements sound familiar? Conversations around artificial intelligence (AI) in health care are often marked by a surprising level of confidence and polarisation. The way we talk about AI often reflects what we think it should be, rather than being based on actual evidence. There could be many reasons for this, but three we regularly encounter are the vagueness and breadth of the AI promise, which permit each of us to construct our own assumptions around its opportunity or threat; the ubiquity of AI in media, fiction, and consumer electronics, which can give an unsubstantiated impression of capability, including in the context of health care; and a desperation for an easy solution to our health crisis, in which we reach for AI as the provider of low-cost, 24/7, high-quality health care, addressing capacity–demand mismatch and reducing our dependence on those fallible, imperfect components in the health care system known as humans.

Encouragingly, our observation is that the conversation has moved on. The past 5 years has seen an increasing recognition that although AI is different, it is not exceptional: it does not sit outside the rule of evidence nor the rule of law. During this time there has been a focus on ensuring that the gatekeepers to our health systems are equipped with the necessary knowledge and tools to critically evaluate AI health technologies, to ensure that these technologies are aligned to benefiting patients, citizens, and society. So, what progress has been made? When it comes to AI health technologies, are we better at judging what good AI looks like now than we were 5 years ago?

As a community, we have decades of experience of evaluating health technologies. Good methodology looks remarkably similar when evaluating an AI health technology and a non-AI health technology doing the same task. Elements of study design that minimise bias—including prespecifying endpoints, having a valid comparator group, and reporting the study transparently—still matter. However, although the foundations are unchanged, over the past 5 years we have been steadily understanding what extra elements are required when evaluating AI health technologies, and these elements have steadily been codified into the various checkpoints or gates in the AI innovation pipeline, including in journal mandates, regulatory requirements, and evaluation frameworks.

Publishing peer-reviewed studies of the development and evaluation of AI health technologies is a key step towards sharing evidence for these technologies in a robust, open, and transparent way. In 2019, we conducted a systematic review of AI diagnostics (specifically deep-learning classifiers of medical images), and found that fewer than 1% of the 31 587 papers we evaluated could inform the question of whether the AI could match human performance undertaking the same task; the remainder had to be excluded for reasons of design or limitations in reporting associated with potential bias.1 This study did two things: first it highlighted that the evidence base for AI performance was much weaker than most believed; and second, it suggested that gatekeepers (in this case journal editors and peer-reviewers) were, at the time, struggling to maintain standards when confronted with the influx of AI studies, many of which appeared to be groundbreaking.

Journals have a crucial role in setting standards for the design and transparent reporting of studies, and have actively engaged in the cross-sector development of AI extensions to key reporting guidelines. These guidelines (eg, CONSORT-AI and others2,3) provide innovators, peer-reviewers, and editors with a shared standard to work to. Additionally, these guidelines have increasingly been referenced by national and international policy bodies and regulators.4,5 An essential component of these guidelines is that, in addition to appropriately addressing AI-specific elements (eg, describing the AI input data, reporting the algorithm version used, and reporting algorithmic errors), they also include standard elements of good design and reporting, thereby encouraging a holistic, balanced view of the evidence—neither overlooking the AI elements nor ignoring everything else.

Although much of our evidence-based principles remain foundational, AI health technologies do have inherent characteristics that create new opportunities and risks. All gatekeepers—journal editors, regulators, payers, health professionals, and others—need to understand the implications of these opportunities and the risks of evaluating, regulating, procuring, and implementing AI health technologies.

One opportunity is the ability to iterate and improve AI technologies, through technical modifications and updates, to optimise model performance. These technical capabilities require new regulatory approaches, with the US Food and Drug Administration (FDA), Health Canada, and the UK’s Medicines and Healthcare products Regulatory Agency (MHRA) introducing Predetermined Change Control Plans to allow manufacturers to update models within a prespecified scope without requiring a new regulatory submission. Like journals, regulators have sought to take an aligned approach, jointly advocating for best-practice principles. In 2021, the FDA, Health Canada, and the MHRA published ten guiding principles of good machine learning practice for medical device development.6
A particularly challenging area is how to bridge the generalisability gap on which many AI models seem to fail. Bridging this gap necessitates local assurance, silent trials (an evaluation of the AI model within the workflow in which the outputs are not acted upon) to establish a baseline performance, ongoing algorithmic auditing, and effective post-market surveillance to ensure that AI technologies continue to perform as expected, both across the market as a whole and within individual settings. The upfront investment to provide the necessary resources and infrastructure to carry out this level of monitoring is, in our experience, often overlooked. The burden currently falls on the manufacturer as the responsible entity with post-market surveillance obligations, but the building of this expertise and capability is necessary for any responsible health institution wanting to implement AI and is an area we should be focusing on in the coming years. Health institutions without this capability should adopt a cautious approach, and implementing AI in this context should be considered at-risk.

In the past 5 years, the risk of bias and the need to address equity considerations and societal harms have received increasing attention. As is often the case with issues of equity and inclusion, recognising the problem is easier than describing a solution, but recognition has highlighted the need to ensure that technologies work inclusively and led to this requirement being increasingly embedded in policy documents.7–9 As with many other AI-related risks, the introduction of bias and the role of data is not unique to AI, but the concern is the scale at which bias could invisibly be introduced at a systemic level. However, there are encouraging signs that, through observing bias in AI, we can better understand current and existing areas of inequity in our practice and provision of health care.10

The challenge for our gatekeepers is that innovators continue to innovate rapidly with new technological breakthroughs. We have found that considerable progress has been made among regulators and other gatekeepers in responding to the challenges and opportunities of narrowly defined applications of AI in specific diseases, such as diabetic retinopathy or mammography for breast cancer. We believe that if the gatekeepers delivered on their strategy documents and roadmaps, we would be in a strong place to effectively evaluate and regulate such technologies. Although the strategic intent is coherent and comprehensive, these changes cannot be implemented without resourcing across the entire ecosystem of AI in health care.

With the advent of generative AI, foundation models, and large language models, these new general-purpose models are blurring the lines of what we currently consider to be medical and non-medical applications. These newer AI technologies are challenging the assumptions on which our current systems are based, and do not fit neatly into the boundaries created by existing rules and regulations that allow our ecosystem to function. How should we regulate a medical large language model? How do we test something that can have infinite different inputs, especially when even the same input can yield different outputs at different times?
Can we answer these questions through continuing to evolve the existing medical device regulatory framework, or are new approaches needed? It is probably too early to tell, but we would argue that the foundational principles of evidence-based medicine and responsible innovation remain the same. We need to remember that however exciting the technology, this is first about people and only second about products. Our approach to evaluation and regulation needs to remain robust, patient-centred, and evidence-driven as we work together to ensure that patients can benefit from products that are safe, effective, equitable, and sustainable.

AKD receives funding from the National Institute of Health and Care Research (NIHR), including through the NIHR Birmingham Biomedical Research Centre (BRC), and is employed by University Hospitals Birmingham National Health Service (NHS) Foundation Trust. XL receives funding from the NIHR, including through the NIHR Birmingham BRC, the NHS AI Laboratory, and The Health Foundation. XL is employed by University Hospitals Birmingham NHS Foundation Trust, was previously a health scientist at Apple, and before that received consulting fees from Hardian Health. The views expressed are those of the authors and not necessarily those of the NIHR or the UK Government’s Department of Health and Social Care. AKD and XL lead the NIHR-funded Incubator in AI and Digital Healthcare.

Share on social media
Facebook
LinkedIn

Envisionit AI on CNN

Envisionit’s journey speaks to how life science innovators can leverage the UK’s strong foundation and pro-tech environment to turn bold

Read More »