Solving the AI Data Problem in Medical Imaging

Posted onOctober 19, 2020 by t.w.jubb

The age of big data might not only be driven by those who master the collection and management of data, but also those who can create it.

Introduction

In this article I’d like to explore a way to solve the data problem in medical imaging AI; that is the lack of data. I’m going to focus around the UK where I work; but the conclusions are relevant for any healthcare industry.

What’s the data problem?

In medical imaging projects; particularly those that involve AI, there is a requirement for imaging data. Obtaining this data is a significant challenge to businesses in medical imaging. It is in the collective interest of everyone to make this data available; it drives innovation and ultimately improves patient care which is what we should all care most about.

It isn’t clear what the true source of the issue is; and is likely to be a number of factors, I don’t claim to be an expert but I will share some thoughts.

What is clear is that the NHS produces a lot of data. Petabytes of data are not only collected; but controlled and stored in such a way that internal research projects within the health service can access it; see EMRAD [1] for a really good example of this in action, or the OPTIMAM [2] database created by the Surrey NHS Trust.

What is also clear is that much attention has been paid to the idea of data within the NHS being made available to drive innovation; and the willpower is there. The UK government have quite clearly stated their intentions through the industrial strategy Grand Challenge; including setting up the NHSX [3].

However, A combination of factors makes the NHS reticent about data sharing with companies.

Sure; large medical companies like GE or Siemens have ready access to data collected by their equipment; and sure some smaller companies have managed to accrue large datasets which understandably remain firmly proprietary.

If you are a researcher in a university you have access to some large datasets for free [4]; but you are at the whim of those who set up the sets and will almost never find one that fits the specific requirements (most are from the USA or China, and often target a very specific disease or pathology).

In this environment; the smaller and traditionally more innovative companies (startups and SMEs) find their innovations curtailed by the data access issue. But it affects research institutes and large companies as well.

This problem has several contributory causes.

Regulations don’t exist and data sharing is done at the individual Trust level (of which there are many; and all governed independently). Some larger syndicates of multiple Trusts have begun farming data, but no national framework exists for data sharing.

Even when this route is an option; the ethics process is lengthy and costly; there is no quick turnaround of results on a project. Various certifications may be required which in themselves are costly and time consuming to acquire.

There are multiple regulators involved, creating a bewildering array of bodies for innovators to navigate and creating confusion for organisations in the NHS and social care who want to make the most of these innovations.
Matthew Gould, CEO of NHSX

The public perception of a hospital selling or distributing images of its patients is fairly negative. Add a multiplier factor when you mention AI. The people working in the medical industry care just as much about patient care as those within the healthcare system; but the commercial world can often viewed more skeptically. The health system serves the public and their interests must be served and so public perception provides a very high-inertia barrier which may take a long time to change.

The financial purchasing power of smaller companies is not enough to make the laborious task of supplying data worth it to hospitals. Pricing strategies for commercialising data can account for this of course; once they exist. The legwork involved for the NHS to put itself in a position to distribute data is huge.

So; who knows which of these possible explanations are more significant. Perhaps there just isn’t the impetus yet in the NHS to share data in a more open commercial environment. Motivators are there; but they are yet to have a significant effect.

But we shouldn’t give up.

Is synthetic data the solution?

Although it’s reassuring to know that bright minds are working on the data issue and the government have made it part of a big strategy in healthcare; the situation here and now is that data access is very limited for companies outside the billion dollar revenue category.

This leaves many questioning the cost/benefit of this “gold standard” data. Perhaps there is a way to solve the data problem without the healthcare system giving away real patient data?

Digital phantoms are a digital representation of the human anatomy which will include a level of detail appropriate for a specific application. The field of digitised phantoms is quite advanced [5] and includes detailed models of all organs, bones, tissues from almost all scales down to nanometers.

The image at the start of this article is a synthetic phantom created and simulated with a toolkit I am currently developing.

A recent example from October 2020, a virtual trial of Covid-19 using a digital phantom. Image from American Roentgen Ray Society, American Journal of Roentgenology.
A phantom breast from the VICTRE project [6].

This data is truly anonymised; it has no relation to a real person. But nonetheless it can be made to look exactly like a real patient using robust statistical techniques.

Using digital phantoms and a simulation toolkit, it is possible to very accurately reproduce the exact imaging conditions as would be seen in clinical practice. You wouldn’t notice the difference between a simulated x-ray and a real one.

The simulation and phantom technology is advanced enough that even experts with 40 years experience in analysing imagery cannot tell the difference (see [6] for a spectacular example).

Rapid advances in computing technology, particularly GPU’s, enables simulations to be performed extremely quickly. Simulations are inherently very parallelizable (each photon can be traced through the patient separately from all the rest) and thus makes use of the key advantage of a GPU. In under 1 minute, we can simulate 10 billion photons passing through a large, high resolution CT volume with 1 billion voxels.

Let’s take this idea further; I believe synthetic data can outcompete a real dataset in almost every way.

Data volume? No problem, rapid simulations can be scaled up with as many GPU’s as you need, and millions of images can be produced. If you’re quick enough; real time data generation allows data to be created at the same rate as it is consumed by a trained algorithm.
Cost? There is no overhead or cost in collecting and anonymising the data. Instead costs are in computational hardware and software maintenance which are significantly less.
What about the occurrence of pathologies like cancer? Surely simulations can’t have accurate disease pathology. Well… This is a challenge of course but can be solved [7]. Each pathology requires an independent study to establish a good quality model; but once that’s done it is simple to include it in the data. Not only that; but more rare forms can be oversampled in the dataset to correct for any algorithmic biasing against less likely diagnoses.
Annotation? Most data used for a specific purpose must be annotated; i.e. labelled in some way by an expert. Synthetic data can be automatically labelled as part of the simulation process for many tasks; the annotations should be 100% reliable
Validation? a Digital Clinical Trial perhaps… The idea of completely digital validation of a clinical product is not new[7]; and results have been very promising. Of course; I am not advocating letting a product loose that has never been tested on real patient data, but think of the cost and efficiency savings of being able to run a full clinical trial in house as part of a companies internal quality assurance.
By the time a product reaches clinical trials; if it has already been validated through a digital clinical trial this could give much higher success rate. A digital trial gives the developers time to identify and fix problems before investing both their and clinicians time in validating against real data.
Accurate Population Variation? This one is a tick for the simulation methodology as well. Techniques like statistical shape modelling (SSMs) allow for a single anatomy model to continuously deform to span the full range of shapes seen within a training set (a small dataset we can access which represents the real population); and also to determine how likely a given new shape is based on the model. These ideas are used routinely to help with surgical planning; and produce the gold standard in bone health assessment using finite element methods.
Unlimited licence; one aspect of using clinical data is that agreements are typically limited in scope and in time. They may also require certain certifications to be met by the company. Synthetic data does not require the same licensing strategy; you can use it for multiple applications for as long as you like.
Enabling new technology; synthetic data gives access to ground truth; that is the total data available for that particular synthetic patient. Certain applications are impossible to attempt with clinical data; for example, a normal 2D radiograph does not come along with data about the 3D morphology of the objects in the scan (bone volume for example). With synthetic data that information is available, and that in itself unlocks a new world of possible applications.

The AI Perspective

This article wouldn’t be complete without a mention of Artificial Intelligence; but look how well the idea of synthetic data plays into the hands of the AI practitioners. AI lives and dies based on the quality and quantity of its data and through synthetic data we can increase both.

AI has had some run-ins with issues like racial bias in the past; this type of problem can be eliminated with simulated data. If there are under-represented demographics in the real data population; we can enhance their representation in the synthetic data.

The simulations take full control over the imaging conditions and patients within the data used to train AI; not only does this allow for control over what the algorithm learns from; but it also allows for controlled excursion tests into the AI; to detect biases and to establish the limits of it’s performance. In short, to validate the algorithm. This goes some way to solve the problem of the explainability (or lack of) in AI. The explainability of AI is considered a significant barrier to its adoption in healthcare.

Wrap up

So picture this…

You have a new idea for a technique or product to improve the patient outcomes in a particular application in medical imaging. You need some data to test this idea but have neither several years nor several millions to test your hypothesis.

You should use synthetic data

You can pick whatever body part you want; specify the exact population demographic you want to look at, and produce both healthy and diseased examples. You can pick your imaging system from any available across the globe and then you can have a huge dataset in a matter of hours to train your algorithms, and later clinically validate them.

The “age of big data” might not only be driven by those who master the collection and management of data, but also those who can create it.

Summary and Conclusion

Medical imaging, as with many fields, is increasingly data-driven and making amazing strides in AI development. Not many areas of society generate quite as much data as the healthcare system.

But in the present day, there are many issues with actually accessing that gigantic volume of data; and innovation could be throttled by the barrier to data access, particularly in the commercial sector of startups and SMEs

The problem is being worked on from within the health service; but who knows how long before we solve the problem, particularly if we require a shift in public perceptions.

So why not just simulate the whole thing? The technology to do so exists; and has been proven to create data which is indistinguishable from the real thing and it plays into developing technologies like GPU hardware and AI algorithms perfectly. There are so many advantages to simulating medical imaging data that these techniques might ultimately achieve a higher quality dataset and technology.

I think simulation for clinical data is the future.

References

[1] EMRAD
[2] OPTIMAM, paper. This database is one of the few examples of a dataset which has been successfully commercialised and serves as a great template for further work in areas outside of mammography.
[3] UK Govt. Industrial Strategy,
NHSX,
White paper : NHSX A Buyers Guide to AI in Health and Care
[4] See for example, MURA, SICAS SMIR, TCIA
[5] Advances in Computational Human Phantoms and Their Applications in Biomedical Engineering – A Topical Review
[6] Evaluation of Digital Breast Tomosynthesis as Replacement of Full-Field Digital Mammography Using an In Silico Imaging Trial.
[7] The VICTRE project, or a nice AI example.