Why AI models often fail in practice

12/18/2024
Reading time 1 min.

Artificial intelligence (AI) based on machine learning offers opportunities for the life sciences. However, problems often arise in practice. One cause is data leakage, the illicit spillover of information from the training to the test data. Researchers from the Technical University of Munich (TUM), the University of Applied Sciences Weihenstephan-Triesdorf (HSWT), and other research institutions are now advocating for more interdisciplinary collaboration in a new guideline. In this interview, Dominik Grimm, Professor of Bioinformatics, and Markus List, Professor of Data Science in Systems Biology, explain why it is crucial to address this issue now.

A student working on the mass spectrometer — Artificial Intelligence enables the analysis of large amounts of data and the detection of connections that remain hidden from humans. This data can, for example, come from cell samples that are – as shown in the image – more closely identified at the molecular level using mass spectrometry.

Why did you see the need for the guideline you published with researchers from FAU Erlangen, the Helmholtz Institute for Pharmaceutical Research Saarland, and Saarland University?

Dominik Grimm: There is a lot of activity in this area, which is good because many questions can no longer be answered with purely human analytical capabilities. At the same time, there is a discrepancy between the results obtained in studies and those obtained in real-world applications. Results are often not reproducible. This poses a significant risk, for example, when these models are used in clinical diagnostics.

Markus List: Many publications present models with very high predictive accuracy. This creates a false sense of security, as the model initially appears to reliably solve the required task. However, it is often impossible to understand how the model arrived at its predictions. Machine learning problems and hidden data dependencies can lead to unrealistically high accuracy. The latter can only be identified with expertise in both machine learning and the life sciences. Therefore, we advocate for more collaboration between the different disciplines to combine their competencies. This way, they can identify problems that caused by hidden dependencies.

What do you mean by hidden dependencies?

List: Often, data from a single study is used to develop models. It is rarely tested whether models also work in practice with data collected in a different location or with other measuring devices. For example, imagine researchers creating a dataset describing the microbiome of 500 people from Munich. We share this data and use 400 samples as training data for the model. We initially hold back 100 samples to measure how well the model applies to unseen data—these are our test data. The model then learns to recognize patterns present at the molecular level in patients living in Munich. It works very well with the 100 held-back samples—the test data. However, when applied to people in Hamburg, the results suddenly differ. One cause could be hidden dependencies, such as people living in Munich having a different microbiome than the population of Hamburg.

A problem also arises when the model is trained with information that is unavailable later. For example, if you want the model to predict whether someone will develop high blood pressure, you use clinical data from people with high blood pressure as training data. The model then looks for indicators of high blood pressure and finds that patients take antihypertensive drugs. However, if you use it for a person with undiagnosed high blood pressure, you will not see this feature in the clinical data because the person is not yet taking medication.

So parts of the training data end up in the test data, but they shouldn’t be there?

Grimm: Yes, that’s correct. We call this data leakage, which can be described as the illicit spillover of information from the training data to the test data. There are hidden correlations between irrelevant or misleading measurements in the actual application. Our guidelines aim to raise awareness of this problem and, more importantly, to improve the understanding of data and applications. This way, hidden dependencies can be identified early, and data leakage can be avoided when developing and training new models.

List: Ultimately, it’s a matter of carefully considering the application for which the models are being developed. When training, you must ensure that you have the appropriate data for the specific application. However, independent data is often not available for testing. To successfully train robust models, they must be designed to avoid taking shortcuts or incorporating biases.

Can you briefly explain what you mean by that?

List: Often, data is trained to represent certain aspects unilaterally. In the previous example of the microbiome, this geographical component was not sufficiently considered. In practice, we often encounter the problem that well-researched diseases are overrepresented in databases compared to those with little established knowledge. Such biases can lead to incorrect predictions by the models.

And what happens if these problems are not addressed?

Grimm: Data collected over decades of research is stored in databases and can be used for subsequent research projects. If errors creep in, they perpetuate themselves in subsequent studies. Ultimately, this could affect medical treatment and, in the worst case, even jeopardize patient safety.

List: This problem is exacerbated as we collect more data and the methods become more complex. With simple models, it is still possible to understand how a result comes about. With highly complex neural networks, this eventually becomes impossible. We must break open the black box, critically examine possible biases, and test models for practical applicability. Many researchers are also developing new methods that allow us to look into the black box and understand decision-making processes.

Grimm: Researchers need to understand the complexity of the data and dependencies, and what they are feeding the algorithms. They also need to be clear about the questions they want the models to answer. Used wisely, models can help us narrow down search spaces and find clues to solutions. It is now essential to steer the work with the models in the right direction to achieve this.

Publications

Bernett, J., Blumenthal, D. B. et al.: Guiding questions to avoid data leakage in biological machine learning applications. Nat Methods 21 (2024). doi.org/10.1038/s41592-024-02362-y

Further information and links

The Professorship of Data Science in Systems Biology is part of the TUM School of Life Sciences, the Professorship of Bioinformatics is part of the TUM Campus Straubing.
Prof. Dr. Markus List is a core member of the Munich Data Science Institute (MDSI).
Research on Artificial Intelligence at TUM

Technical University of Munich

Corporate Communications Center

Contacts to this article:

Prof. Dr. Markus List
Technical University of Munich
Professorship of Data Science in Systems Biology
Tel.: +49 8161-71-2761
markus.listspam prevention@tum.de

Prof. Dr. Dominik Grimm
University of Applied Sciences Weihenstephan-Triesdorf & Technical University of Munich
Professorship of Bioinformatics
Tel.: +49 9421-187-230
dominik.grimmspam prevention@hswt.de

Settings

Design

Artificial Intelligence in the Life Sciences
Why AI models often fail in practice

Why did you see the need for the guideline you published with researchers from FAU Erlangen, the Helmholtz Institute for Pharmaceutical Research Saarland, and Saarland University?

What do you mean by hidden dependencies?

So parts of the training data end up in the test data, but they shouldn’t be there?

Can you briefly explain what you mean by that?

And what happens if these problems are not addressed?

News about the topic

Artificial Intelligence in the Life SciencesWhy AI models often fail in practice

AI pinpoints stroke timing with high accuracy

“The goal of AI is to make our lives easier”

New method for designing artificial proteins

“Data are the crucial component for generative AI”

Artificial Intelligence in the Life Sciences
Why AI models often fail in practice