/Connect.Learn.Be Inspired.

Meet

The DDSA Fellows

Meet the visionary and creative-thinking PhD and Postdoc Fellows who has received a Fellowship from Danish Data Science Academy. In the sections below, you can learn more about the fellow and their research projects.

PhD Fellows 2024

Søren Vad Iversen

Position: Improved sampling for Bayesian Inference

Categories: Fellows, PhD Fellows 2024

Location: University of Southern Denmark

Abstract:

The aim of this project is the development of improved sampling algorithms for Bayesian inference. To achieve this we build upon the ideas presented in and, to create an efficient and robust sampling scheme, that works well on both high dimensional and multimodal distributions. As an initial application the project will develop a Bayesian model selection tool with an integrated goodness-of-fit test, that reveals how one might improve upon the model to describe the data even better. The goodness of fit test will be build upon the test represented in but using trigonometric functions as an alternative extension, to the exponential family used in.
The hope is that this leads to new avenues of applications for Bayesian inference, due to a sampling algorithm robust and efficient in a wider range of scenarios, such as the improved Bayesian model selection scheme of this project. Allowing the true potential of Bayesian inference to be available for the problems at hand.

Søren Vad Iversen

Improved sampling for Bayesian Inference

Abstract: The aim of this project is the development of...

Anders Gjølbye Madsen

Position: Causal Approach to Trustworthy Artificial Intelligence in Healthcare

Categories: Fellows, PhD Fellows 2024

Location: Technical University of Denmark

Abstract:

The proposed PhD project aims to enhance the trustworthiness and interpretability of machine learning models in healthcare by leveraging causal approaches. It addresses the need for models with human-interpretable, causal handles to ensure AI systems are reliable and understandable to professionals. The project proposes to investigate and develop methods to distinguish between predictive and explanatory features in ML models, emphasizing the importance of causality in model interpretability.

Methodologically, the research will explore both the enhancement of existing models and the refinement of post hoc interpretation analysis. A notable direction involves extending the approach presented by Haufe et al. (2014) for linear models to the nonlinear context of deep learning, specifically targeting the discrepancy between predictive accuracy and meaningful explanations. Moreover, the project proposes to build upon established explainable AI (XAI) methods, such as LIME, incorporating Haufe et al.’s approach to improve LIME’s ability for causal explanations. An initial practical step will also revisit the applicability of the TCAV method to neuroimaging foundation models, aiming to further validate the use of human-interpretable concepts in a Concept Bottleneck Model setting.

The significance of this research lies in its potential to close the trust gap in AI applications within healthcare. By ensuring that AI systems are both reliable and understandable, the project supports the goal of making AI in healthcare settings a valuable tool for professionals, ultimately leading to better patient outcomes. The theoretical advancements anticipated from this research will provide a foundation for future exploration into learning causal concepts for foundation models, with a long-term vision of facilitating personalized care through inherently causal deep learning models.

Data for this project will be sourced from an exclusive agreement with BrainCapture, providing annotated EEG scans from clinical studies, alongside publicly available large-scale EEG datasets and institutional resources. This rich data foundation will support the empirical aspects of the research, allowing for the validation and refinement of proposed methodologies in real-world scenarios.

Anders Gjølbye Madsen

Causal Approach to Trustworthy Artificial Intelligence in Healthcare

Abstract: The proposed PhD project aims to enhance the trustworthiness...

Niklas Gesmar Madsen

Position: Perturbing the Rhythm; Interventions on Enzyme Dynamics with Geometric Graphs.

Categories: Fellows, PhD Fellows 2024

Location: Technical University of Denmark

Abstract:

Enzymes lie at the centre of a biosustainable and circular bioeconomy, but nature’s catalysts have a recursive complexity to them, often hidden in their dynamics. With the advances of genomics, sequence data is readily used to model and improve enzyme function. Similarly, advances in structural biology and geometric deep learning spark structurally driven enzyme engineering. Yet, the lasting frontier is to understand and design enzyme dynamics (i.e., the temporal dance), which is tightly correlated with enzyme functions (turnover, allostery, selectivity). Designing enzyme function via dynamics is critical for a wide range of enzyme-based applications in biotechnology.

State-of-the-art computational methods have found graph-representations of dynamics that correlate strongly with the location of mutations found in directed evolution studies. But how do mutations affect enzyme dynamics and function?

The project addresses this central question in three stages: (1) elucidate the scaling laws of dynamic representations, characterising how and when they can be obtained efficiently, (2) train a geometric graph neural network to predict dynamic representations directly from a static structure, and (3) develop a graph intervention algorithm which can account for the effect of a mutation on the dynamic representation.

The aim is to develop a methodology that can modulate enzyme dynamics and thus control function including activity. The efficacy of the methods will be experimentally demonstrated on an enzyme-system, which is crucial for drug development and the biosynthesis of a wide range of natural products. We will combine high-throughput data generation with extensive molecular dynamics simulations as well as graph perturbations and machine learning, thus working at the intersection between data and computer science, bioinformatics, enzyme engineering, and computational chemistry. These methods aim to accelerate a rhythmically informed enzyme engineering.

Niklas Gesmar Madsen

Perturbing the Rhythm; Interventions on Enzyme Dynamics with Geometric Graphs.

Abstract: Enzymes lie at the centre of a biosustainable and...

Kasper Fyhn Borg

Position: Collective causal reasoning: extracting and modeling networks of causal relations in Danish and English

Categories: Fellows, PhD Fellows 2024

Location: Aarhus University

Abstract:

The early 2020s have been defined by the concurrent global crises of the pandemic and climate change, characterized by complex interplays of causes, effects, and (real and potential) interventions. Communication about these crises reflect rich causal and counterfactual reasoning over collectively negotiated models of the world. Presently, the argumentative structure of collective discourse can only be studied qualitatively, which imposes limits on the generalizability and scalability of research findings, largely because the task of Causal Relation Extraction (CRE) at scale is underdeveloped in NLP, and non-existent for low-resource languages like Danish.

The project leverages state-of-the-art large language models (LLMs) and few-shot prompt-based training to implement a ground-breaking computationally-assisted approach to CRE at scale: modeling collectively constructed causal models via causal linguistic reports in texts. It represents the first NLP implementation of causal modeling at scale and is developed with multilingual support for both English and Danish. By developing methods to automate the extraction of collective causal models from corpora and produce interpretable graphs of their underlying structure, we allow causal relations to be investigated empirically and at the scale of public discourse.

Causal language is a window into how humans reason causally and counterfactually, a capacity widely held to be the hallmark of human intelligence, and a key topic in research on science and crisis communication, mis/dis-information, and public trust and solidarity. Unlike many computational methodologies, our models and tools will be developed and fine-tuned through application to social scientific research questions. This integrated and research-guided approach ensures that model performance will be evaluated for explainability, interpretability, and robustness by domain-experts at every step. Our open source models and published results will have broad applicability for researchers across disciplines, as well as external stakeholders like policymakers and public health officials.

Kasper Fyhn Borg

Collective causal reasoning: extracting and modeling networks of causal relations in Danish and English

Abstract: The early 2020s have been defined by the concurrent...

Johanna Düngler

Position: Understanding the interaction of Privacy, Robustness, and Fairness in machine learning algorithms

Categories: Fellows, PhD Fellows 2024

Location: University of Copenhagen

Abstract:

Machine learning models and algorithms are increasingly integrated into critical decision-making processes across various domains such as healthcare, finance, criminal justice, and more. The decisions made by these models can have profound consequences on individuals and society. Due to this, Trustworthiness is a critical facet of contemporary Machine Learning research and applications. This project aims to focus on the notions of Robustness, Privacy, and Fairness within ML systems, exploring the trade-offs and dependencies among them.

Past works have shown an intricate relationship between those notions and balancing them often involves making compromises, as optimising one aspect may come at the expense of another. Starting from Cummings et al. [1], several works [2,3] were able to show there is a potential huge cost in accuracy when trying to achieve fairness and privacy simultaneously. Further balancing between different notions of fairless involves making decisions that optimise for one aspect while accepting trade-offs in the other [4]. With regards to robustness, a theoretical equivalence between robustness and privacy [5,6] was shown. However, this equivalence remains theoretical as it’s not computationally feasible to implement it in practice.

The first part of the project seeks to identify conditions on data and algorithms that enable the simultaneous achievement of privacy, robustness, and fairness. Additionally, we aim to quantify the minimum sacrifice in utility required for these notions to coexist harmoniously and design practical machine learning algorithms that successfully achieves this trade-off. Lastly, we are investigating whether relaxing any of these trustworthiness criteria can facilitate their coexistence with a reduced cost in utility. This project will be cosupervised by Amartya Sanyal and Rasmus Pagh, both at the Department of Computer Science of the University of Copenhagen.

The results of this project can significantly impact how we use machine learning algorithms in real-world settings. By understanding the limits of trustworthiness and quantifying the costs in terms of accuracy, vulnerability, and model training time incurred to enhance their trustworthiness, valuable insights can be provided to stakeholders.This includes policy makers, healthcare experts, and the public, allowing them to make informed choices when deploying ML systems, balancing the need for robustness with fairness, privacy, and utility considerations.

Johanna Düngler

Understanding the interaction of Privacy, Robustness, and Fairness in machine learning algorithms

Abstract: Machine learning models and algorithms are increasingly integrated into...

Martin Sæbye Carøe

Position: Multi-modality reconstruction methods for robust mineral classification in the ”Black Beauty” Mars meteorite using neutron and X-ray tomographyAbstract

Categories: Fellows, PhD Fellows 2024

Location: Technical University of Denmark

Abstract:

Computed tomography is a very widespread approach to visualize multi-phase materials in 3D. This project will explore novel mathematical methods that combine several modalities, in particular X-ray CT (XCT) and Neutron CT (NCT) data to improve the resulting reconstruction and subsequent classification of a sample. Currently, reconstruction methods that treat each data modality separately are being used, from which the reconstructions are then combined into a single segmentation. This does not take into account the different nature of the data. In the project, we will study variational regularization methods that integrate prior information and use statistical models for the measurement errors. We will also use a so-called material decomposition method that expands the data in terms of a basis function for each material.

In order to demonstrate the methods, the project will team up with an existing collaboration between the 3D Imaging Center at DTU, NBI and the planetary science group at GLOBE, KU. These groups study the Mars meteorite NWA 7034, “Black Beauty” with the purpose of gaining understanding about planetary formation and habitability. With the newly developed methods, the hope is that for the first time we will be able to segment the grains of this large meteorite into distinct minerals. This type of non-destructive characterization is vital, as the meteorite is one-of-a-kind. If successful, this type of analysis is relevant for characterization of the samples returned to Earth from the Mars Perseverance mission in 2030, and as such the DDSA funded science may have an impact for ESA.

Martin Sæbye Carøe

Multi-modality reconstruction methods for robust mineral classification in the ”Black Beauty” Mars meteorite using neutron and X-ray tomographyAbstract

Abstract: Computed tomography is a very widespread approach to visualize...

Mathilde Diekema

Position: Enhancing Circulating Tumor DNA Detection in Whole Genome Sequencing Data through Deep Generative Modelling

Categories: Fellows, PhD Fellows 2024

Location: Aarhus University

Abstract:

Cell-free DNA (cfDNA) in the bloodstream, which includes circulating tumor DNA (ctDNA) from tumors, is a promising biomarker for cancer detection and monitoring. However, the minute amounts of ctDNA, especially in early-stage cancers or small tumors, make detection difficult, often indistinguishable from sequencing errors. This PhD project proposes to introduce deep generative models to the ctDNA field to facilitate analysis and improve detection performance. Building on the DREAMS (Deep Read-level Modelling of Sequencing-errors) method, I will integrate a Deep Generative Decoder (DGD) to model both noise and cancer-related signals. Through latent-space sample representations, the DGD model is hypothesised to better capture the complex, patient-specific ctDNA signals and sample-specific noise patterns in whole genome sequencing (WGS) data. This project is supported by extensive, local WGS cfDNA datasets from colorectal cancer (CRC) patients, bladder cancer (BC) patients and healthy controls, offering a unique opportunity to refine and validate the proposed methods. It encompasses three primary aims: develop a DGD of sample-specific cfDNA error rates, quantifying ctDNA fractions using latent space modelling, and creating a multimodal ctDNA detection framework. These efforts promise to enhance the sensitivity and specificity of ctDNA detection, facilitating earlier cancer detection, improved tumor burden monitoring, and personalised treatment strategies.

Mathilde Diekema

Enhancing Circulating Tumor DNA Detection in Whole Genome Sequencing Data through Deep Generative Modelling

Abstract: Cell-free DNA (cfDNA) in the bloodstream, which includes circulating...

Sijia Feng

Position: Physics-informed machine learning with multi-source satellite data to quantify field-scale N2O emissions for climate-smart agriculture

Categories: Fellows, PhD Fellows 2024

Location: Aarhus University

Abstract:

Nitrous oxide (N2O) is a highly important greenhouse gas and its anthropogenic sources primarily come from fertilized croplands. The spatiotemporal patterns of N2O emissions are characterized by hot-spot and hot-moment effects, which are controlled by hydrological processes, microbial C and N substrate availability, and soil microbial activity. The complex processes result in large uncertainties in quantifying the spatiotemporal variability of field-scale N2O emissions. Therefore, cost-effective and accurate estimates of field-scale N2O fluxes are highly needed to reduce the climate footprint of agriculture and to mitigate climate change. To achieve this goal, we propose to use a Physics-Informed Machine Learning (PIML) framework to integrate process-based ecosystem modeling and multi-source remote sensing data to quantify field-scale N2O emissions for the EU wheat cropping systems. Specifically, there are three tasks for this study. (1) We will use PIML to retrieve soil moisture from multi-source remote sensing data along with field measurements to support the identification of hot moments for N2O emissions. (2) We will adapt the computer vision foundational model, Segment Anything Model, to detect ponding areas in high-resolution satellite imagery to identify hot spots of N2O emissions. (3) We will develop another PIML based on the biogeochemical LandscapeDNDC model, which has been extensively tested for its capability to predict N2O emissions from field to continental scales. Task 1 soil moisture, Task 2 ponding areas, and N2O measurements to quantify field-scale N2O emissions across the EU wheat cropping systems from 2020 to 2026. Furthermore, scenario analysis will be conducted to assess the impact of fertilizer rates and cover cropping on N2O emissions to develop mitigation strategies.

This interdisciplinary project will bring significant scientific and socioeconomic impacts. Through PIML, this project will be the first time to explicitly consider the ponding effects in quantifying field-level N2O emissions, which can improve our understanding of driving factors for N2O emissions. In addition, it is highly aligned with Danish and EU policies on green transition and climate-smart agriculture. The expected outcomes can be used for voluntary carbon credit markets to bring economic benefits. Furthermore, this technology is also highly transferable to other regions for the sustainability of crop production and environmental health.

Sijia Feng

Physics-informed machine learning with multi-source satellite data to quantify field-scale N2O emissions for climate-smart agriculture

Abstract: Nitrous oxide (N2O) is a highly important greenhouse gas...

Christian Mikkelstrup

Position: Biomedical Image Segmentation using Graph Cuts

Categories: Fellows, PhD Fellows 2024

Location: Technical University of Denmark

Abstract:

I propose a project that combines Deep Learning (DL) and Graph Cuts for more accurate image segmentation methods of medical and biomedical 3D volumes. Current state-of-the-art segmentation research is dominated by Convolutional Neural Networks (CNNs) like U-net because they have high accuracy and are adaptable to almost any problem. However, they require large annotated 3D datasets and, since their segmentation is based on voxel-wise labeling, do not guarantee known topological features and object shapes. On the other hand, the Graph Cut is a classic segmentation method [16] based on graph theory and maximum a posteriori energy minimization. With graph-cut-based methods, we can model smoothness, surfaces, and object inclusion/exclusion, all without needing training data. It does, However, it requires spatial initialization and modeling of the image intensities to be used for the graph energies.

Combining DL-based segmentation with Graph Cuts has great potential as DL can solve the problem of graph initialization while graph segmentation can ensure the correct topological features. How to combine the methods optimally and efficiently is what I propose to investigate in my PhD. We will investigate hybrid methods for increased prediction robustness and a unified pipeline where the graph segmentation is part of the DL training, reducing the required training data. We also plan to take advantage of geometric shape priors by modeling them directly into the graph. By creating new algorithms for Graph Cuts, we expect faster algorithms that can additionally be used in an interactive segmentation setting.

To support me, I have a great cross-disciplinary team with Professors Inge Li Gørtz as principal supervisor and Anders B. Dahl as co-supervisor. Inge is an expert in combinatorial optimization for graphs and the analysis of algorithms. Anders is an expert in 3D image analysis with extensive experience in 3D image segmentation and applying segmentation to biomedical 3D image analysis problems. Through DTU Compute, I will also have access to the relevant resources, both in terms of compute and volumetric data, to enable the success of the project.

The research towards faster and more robust methods will be shared through publicly available code, enabling this general method to be applied to related problems. Investigating this segmentation approach will also lay the foundation for further research in this interdisciplinary field.

Christian Mikkelstrup

Biomedical Image Segmentation using Graph Cuts

Abstract: I propose a project that combines Deep Learning (DL)...

Postdoc Fellows 2024

Benjamin Skov Kaas-Hansen

Position: BECAUSE-ICU: Better Trial Enrichment with Causal Evidence from Intensive Care Unit data

Categories: Fellows, Postdoc Fellows 2024

Location: Rigshospitalet

Benjamin Skov Kaas-Hansen:

Purpose:
BECAUSE-ICU will build the first iteration of a data warehouse with large-scale, real-world data from Danish intensive care units, in an OMOP common data model. Then, we will use the data warehouse to replicate results of previous clinical trials, and predict the results of ongoing or imminent trials.

Methods:

BECAUSE-ICU will continue current work on a proof-of-concept extract-transform-load pipeline that transforms complex source data from machines and manual registrations, stored in tens of thousands of files, into an OMOP common data model. The data warehouse will live on a secure high-performance computing infrastructure and follow data engineering best practices, including version control and logged access control. Code and software will be containerised and shared under lenient licenses.

Real-world causal evidence will be generated with a variety of conventional and machine learning methods for causal inference, using both frameworks built specifically for OMOP’ed data and general-purpose frameworks. We will build undirected graphs of head-to-head comparisons of interventions and exploit the network metanalytic framework to synthesise one effect-size estimate for each exposure-outcome pair and scrutinise the results by comparing direct with indirect evidence.

Significance:

BECAUSE-ICU is a true (clinical) data science project combining methods from DataOps, machine learning, data visualisation and clinical epidemiology.

BECAUSE-ICU builds the foundation for a lasting data infrastructure, built to scale and give faster answers to more questions using better data.

BECAUSE-ICU will alleviate challenges of current clinical trials in critical care by identifying performant enrichment and stratification schemes.

BECAUSE-ICU code and software will be open source so others can reuse and repurpose our work for their needs without locking themselves into ecosystems of specific vendors.

BECAUSE-ICU will (also) enable efficient build-up studies to identify and prioritise promising candidates for Intensive Care Platform Trial domains (see below) through fast cohort characterisation and causal inference estimation. This is important as only 10-15% of current critical interventions rest on high-level evidence.

Benjamin Skov Kaas-Hansen

BECAUSE-ICU: Better Trial Enrichment with Causal Evidence from Intensive Care Unit data

Purpose: BECAUSE-ICU will build the first iteration of a data...

Emil Michael Pedersen

Position: Developing a Multi-trait liability model for gene discovery in large biobanks

Categories: Fellows, Postdoc Fellows 2024

Location: National Centre for Register-based Research

Abstract:

Modern biomedical and health science research leans heavily on data science to extract meaning from large-scale biobanks like UK biobank, iPSYCH, CHB/DBDS, etc.. These biobanks offer large collections of genetic and phenotypic data and are an invaluable source of insight into the human genome, aetiology of complex disorders, and inspiration for methodological developments. My project aims to develop and apply a new method for estimating genetic liability from broad, complex register, and biobank data. Using this (genetic) liability in genome-wide association studies (GWAS) should increase power for gene discovery and prediction.

We aim to develop a multi-trait extension of my dissertation work developing the age-dependent liability threshold model (ADuLT) – multi-trait ADuLT (mADuLT). mADuLT will incorporate multiple genetically correlated phenotypes, age of onset, and family history, at the same time, in the liability estimation. We will compare the extension to other state of the art multi-trait GWAS methods in extensive simulations and develop selection criteria for when and how researchers can best utilise multiple phenotypes for gene discovery. I will apply mADuLT to several Danish biobanks and UK biobank in a series of increasingly complex GWAS. We will explore mADULT’s utility for single-trait analysis, e.g. MDD. Then we will explore mADULT’s utility for detecting variants with shared effects, moving to disease domains, e.g. psychiatric disorders, and finally general health, e.g. all disease. The results from each step will be scrutinised with post-GWAS analysis with FUMA, which assess results’ plausibility via biological and functional information. Finally, we train polygenic scores (PGS) on multi-trait GWAS to assess increases in predictive power and whether the correlation structure of multi-trait trained PGS is consistent with single-trait trained PGS.

My proposed method will allow researchers to extract even more information from modern, large-scale biobanks. My proposed applications will further our understanding of specific and shared factors in disease aetiology and could assist in identifying new drug targets or enhance predictive models for clinical application. We will develop an open source software package to support the continued free access to research and to ensure replicability.

Emil Michael Pedersen

Developing a Multi-trait liability model for gene discovery in large biobanks

Abstract: Modern biomedical and health science research leans heavily on...

Madeleine Wyburd

Position: Towards Early Detection of Structural Biomarkers of Cerebral Palsy

Categories: Fellows, Postdoc Fellows 2024

Location: University of Copenhagen

Abstract:

Cerebral Palsy (CP) is the most common motor disability in children, characterised by impaired movement, muscle coordination, and maintenance of posture. Early diagnosis of CP is essential, as early treatment and intervention can drastically improve the outcomes. A recent study has shown that the use of Magnetic Resonance Imaging (MRI) can help diagnose CP as early as 5 months, however, it requires specialist detection of brain malformation. Thus, an automated pipeline that can detect malformation from an MRI scan, i.e. by measuring specific tissues’ volumes and comparing them to the expected development, has the potential to aid the early detection of CP. However, while well-established neuroanalysis tools exist for adults, their application in infants (the period between 2 months and 2 years of age) is unfeasible because of the vast difference in brain size and the changes in the appearance of white and grey matter. Within this project, we aim to improve infant neuroimaging by developing a new state-of-the-art (SOTA) pipeline, to quantify brain development between 3 months to 2 years, a period of vast change. The project is a collaborative effort with Lilla Zöllei, the developer of the current SOTA infant neuroimage analysis tool, which we aim to improve upon by using deep-learning algorithms.

Once we have developed an algorithm to analyse the infant brain, we will explore the differences between CP and expected brain developments. To facilitate this investigation, I will use data collected as part of the NIBS-CP and CP-EDIT: two large studies that are running in parallel to follow the development of (up-to) 200 subjects; 50-75 of whom are high-risk of CP and the remaining healthy controls. Each recruit will have a series of longitudinal MRI scans at 3-9 months, 12 months and 24 months, paired with developmental outcomes. Thus, there is the potential to detect early CP biomarkers from 3 months. This rich data can then be used to build normative models of a typically developing Danish infant population and investigate whether we can identify predictive features of CP and motor deficits in the CP population. We hypothesise that accurately quantified brain structures can facilitate earlier diagnosis of CP and predict developmental outcomes, potentially leading to earlier intervention and thus better outcomes. Further, a normative model of typically Danish child development may promote the early diagnosis of other developmental disorders.

Madeleine Wyburd

Towards Early Detection of Structural Biomarkers of Cerebral Palsy

Abstract: Cerebral Palsy (CP) is the most common motor...

Samuel Matthiesen

Position: Towards Uncertainty- and Geometry-aware Generative Models

Categories: Fellows, Postdoc Fellows 2024

Location: Technical University of Denmark

Abstract:

Representation learning aims to transform data to extract relevant information for other tasks. This problem can be understood as being encompassed by generative modelling, which learns a mapping from a latent representation into a data point. As machine learning becomes more pervasive, it is critical to quantify confidence in the model behaviour in fields such as life sciences, security, and general decision making. In this project, we aim to address current limitations of modern approaches to fundamental problems of representation learning and generative modelling. We consider two related lines of research. The first aims to scale Bayesian inference to modern problems of generative modelling, enabling a principled approach to evaluate model behaviour with uncertainty estimates. The second is concerned with the geometry of the latent spaces of those models, allowing us to properly inspect and operate on them. We expect those models to be robust when put in scenarios different from those represented by the training data, and to allow sound analyses of high-dimensional, complex data to be conducted within their latent spaces. Primarily, we consider Gaussian process latent variable models (GP-LVMs) for both lines of research. These models are uniquely able to be employed in similar tasks to modern neural networks and, under certain conditions, have closed-form formulas for the expected metric induced in the latent space, an advantage over neural networks. As a starting point for research on scalability, we consider Laplace approximations that scale linearly with data size. A promising way to bring GP-LVMs to a modern setting could involve a linearised Laplace approximation of an autoencoder, which is based on a neural network, effectively transforming the generative part (decoder) into a GP-LVM. Furthermore, we intend to explore how to make use of the expected metric for GP-LVMs for larger problems. Their closed-form formulas can be inefficient to work with. Recent advances by modern automatic differentiation engines are a promising avenue for solving this, as the Jacobian of the model is usually needed for constructing the expected metric tensor. This requires careful reformulation of common operations on Riemannian manifolds. Together, the proposed lines of investigation aim to build scalable, uncertainty-aware generative models whose latent spaces are geometrically well-understood.

Samuel Matthiesen

Towards Uncertainty- and Geometry-aware Generative Models

Abstract: Representation learning aims to transform data to extract relevant...

Martin Rune Hassan Hansen

Position: Improved detection of diabetes mellitus among African adults using machine learning

Categories: Fellows, Postdoc Fellows 2024

Location: Steno Diabetes Center Aarhus

Purpose:

The objective of the project is to develop risk scores that can be used for detection of undiagnosed diabetes mellitus in the general adult population of Africa, using readily available information on e.g. gender, age, physical activity level, diet, blood pressure and weight status. We will develop a risk score suitable for self-administration, as well as risk scores suitable for administration by health workers.

Methods:

The outcome to be predicted is undiagnosed diabetes mellitus, defined as elevated fasting blood glucose (either fasting plasma glucose or fasting capillary blood glucose), with no self-reported previous diagnosis of diabetes mellitus, and no consumption of glucose-lowering drugs. We will use cross-sectional data from 43 population-wide surveys that were conducted as part of the World Health Organization (WHO) STEPS program. The surveys were conducted from 2003 to 2020 to monitor non-communicable diseases in 32 African countries and cover the entire continent. We have already received the data from the WHO, and no further data collection is necessary. 125,538 individuals participated in the STEPS surveys and can be assessed for undiagnosed diabetes (fasting blood glucose measured, no pregnant women). We will create risk scores using both regression-based (Lasso) and tree-based models (decision tree, random forests), and validate them by k-fold cross-validation. Performance will be evaluated using measures of calibration (predicted vs. observed risk of diabetes mellitus) and discrimination (Area Under the Receiver Operating Curve), and models will be assessed for algorithmic fairness. We will also compare the performance of the models with that of risk scores developed in other populations.

Perspectives:

A validated risk score for diabetes mellitus has the potential to considerably improve the management of diabetes mellitus in Africa, as it will allow targeted screening of high-risk individuals, thus reducing the cost of case-finding. The STEPS surveys were conducted in the general population and the final risk scores will be suitable for administration in the same setting.

The risk scores will be implemented in community-based intervention programs against diabetes mellitus, coordinated by the East African NCD Alliance in Kenya, Rwanda, Burundi, Uganda and Tanzania, and will also be disseminated to African health ministries and other stakeholders.

Martin Rune Hassan Hansen

Improved detection of diabetes mellitus among African adults using machine learning

Purpose: The objective of the project is to develop risk...

Kazu Ghalamkari

Position: NTN-C3: Nonnegative Tensor Networks – Identifying the Limits of Convexity, Gains of Complexity, and Merits of Certainty

Categories: Fellows, Postdoc Fellows 2024

Location: Technical University of Denmark

Purpose:

The drastic spread of machine learning techniques has made it common to obtain scientific knowledge by analyzing data in various fields. Yet, the non-convexity and instability of many machine learning models leads to important issues of reproducibility with results differing from changes to initializations. Within data science and the modeling of biological data sets non-negative tensor networks (NTN) have become a prominent tool providing part based and interpretable representations of large and complex datasets. However, NTN faces issues of non-convexity, leading to instability of applications. This project will introduce an enhanced framework for NTN that provides convexity, expressiveness, and robustness, aiming to mitigate the instability of downstream tasks across various domains reliant on tensors.

Method:

To eliminate the non-convexity of tensor decomposition, I recently developed a novel framework, many-body approximation (MBA) [1], that globally minimizes error. MBA regards tensor modes (axes) as particles and learns the interactions among them with appropriately designed energy functions. Although the convexity of MBA potentially eliminates instability in the variety of downstream tasks, so far, MBA is explored “vanilla” and is currently unsuited to solve specific tasks whereas it remains unclear what makes MBA a convex optimization problem. To make MBA practical, I will discover the essence of convexity, improve complexity, and introduce certainty through noise robustness and prior distributions to the model. Specifically, I will advance MBA to

• Accommodate symmetry, which plays a central role in link prediction in a graph [2] via properly re-designed energy function. Include new state of particles that enables data fusion, where multiple tensors are analyzed at once.

• Expand its complexity by exploring the analogy between MBA and Boltzmann machines [3] and further introducing hidden variables and interference effects [4].

• Exploit prior distributions in the representation and perform noise-robust modeling without loss of convexity.

Significance of the Project:
This project develops an advanced tensor factorization that is convex, expressive, and robust. This can potentially lead to a paradigm shift towards solid data analysis to eliminate initial value dependency from downstream tasks of tensor decomposition and provide stable computational tools for the data science communities reliant on tensors.

Kazu Ghalamkari

NTN-C3: Nonnegative Tensor Networks – Identifying the Limits of Convexity, Gains of Complexity, and Merits of Certainty

Purpose: The drastic spread of machine learning techniques has made...

Cross-Academy Fellows 2024

Jakob Nebeling Hedegaard

Position: Automatic Anomaly Discovery in Registry Data

Categories: Cross-Academy Fellows 2024, Fellows

Location: Aalborg University

Abstract:

Automac Anomaly Detecon in Health Registry Data by Dynamic, Unsupervised Time Series Clustering

Denmark has established a wealth of health registries used to monitor the quality of health care. Although this resource has enormous potential, data has become so complex and highdimensional that important insights in quality of care and patients’ safety may go unnoticed. There is, thus, a need for a dynamic, automated algorithm capable of flagging growing anomalies in registry data, helping health care personnel to rapidly discover important divergencies.

In this project we will develop and test a new algorithm based on dynamic, unsupervised time series clustering with anomaly detection for health care data. At each time point, the algorithm will cluster patients (using, e.g., hierarchical, t-SNE, or autoencoder clustering) based on a patient trajectory metric (e.g., Hamming distance or optimal matching) and the development of anomaly clusters will be monitored by significant change in a cluster dissimilarity measure (e.g., Jaccard distance or MONIC). Thisalgorithm’s output will consist of summaries of detected anomalies in a form that allows for a quick assessment by relevant health care professionals. These summaries will be evaluated by a team of experts, and the algorithm will be tuned based on their input. The algorithm will thus learn, through supervision, to predict expert interests. The algorithm will be developed and tested on the Danish Diabetes Database (DDiD).

Such an algorithm would greatly improve the health care system’s ability to react timely on both positive and negative trends in quality of care. Furthermore, the algorithm will be developed in a disease independent fashion, such that it can be implemented more generally and potentially be used to monitor other areas in critical need of attention.

Jakob Nebeling Hedegaard

Automatic Anomaly Discovery in Registry Data

Abstract: Automac Anomaly Detecon in Health Registry Data by Dynamic,...

Shanshan He

Position: Disentangling the Genetic Basis of Diabetic Kidney Disease Using Single-Cell Multimodal Sequencing in Human Diabetic Kidney

Categories: Cross-Academy Fellows 2024, Fellows

Location: University of Copenhagen

Abstract:

Diabetic kidney disease (DKD) represents a major long-term complication of Type 2 Diabetes (T2D), increasing the risk of kidney failure and cardiovascular events. Yet, the relationship between T2D and DKD is complex, as it is difficult to accurately predict the degree of kidney damage a T2D patient will develop and whether it will eventually develop into DKD. This is largely driven by a lack of understanding of the precise molecular and cellular mechanisms underlying the association between DKD and T2D. This project aims to deepen our understanding of the development of DKD in T2D on a genetic and cellular level through the application of state-of-the-art single-cell multimodal sequencing assay and bioinformatics tools and deep learning models. By simultaneously profiling gene expression and genome-wide chromatin accessibility within the same kidney nuclei, we will construct a comprehensive molecular atlas derived from thirty kidney biopsies representing a spectrum of severity from non-diabetic kidney disease to DKD in T2D patients from the PRIMETIME2 Danish national cohort study.

This atlas will facilitate the generation of cell type-specific gene regulation networks and the integration of regulatory DNA atlases with disease genetic variants obtained from high-powered genome-wide association studies datasets. We will use this to calculate kidney cell type-specific polygenic risk scores (PRSs) to stratify large heterogenous patient groups and validate the predictive power of these cell type-specific PRSs in several large deeply genotyped cohorts.

Through this comprehensive analysis, we aim to gain novel insights into the shared genetic, cellular, and molecular basis of DKD and T2D. This understanding will enhance the prediction and precision treatment of DKD by stratifying the heterogeneous T2D patient group.

Shanshan He

Disentangling the Genetic Basis of Diabetic Kidney Disease Using Single-Cell Multimodal Sequencing in Human Diabetic Kidney

Abstract: Diabetic kidney disease (DKD) represents a major long-term complication...

Malene Nørregaard Nielsen

Position: Illuminating the Potentials of Photoplethysmography in Patients with Atrial Fibrillation using Explainable Artificial Intelligence

Categories: Cross-Academy Fellows 2024, Fellows

Location: University of Copenhagen

Abstract:
Aim:
This project investigates the unexplored clinical potentials of photoplethysmography (PPG) as an assessment tool for patients with atrial fibrillation (AF). We aim to investigate (1) the impact of risk factors on PPG, (2) how AF-related hemodynamic changes are reflected in PPG, and
(3) how ablation treatment for AF affects hemodynamics.

Background:
PPG is a technique that uses light to detect volumetric changes in the peripheral vasculature. It is widely available in wearables and provides a more continuous signal than electrocardiography (ECG). In research, PPG has been used to detect AF with high accuracy comparable to ECG. PPG is less well characterized than ECG, and it is unknown how ageing, hypertension, diabetes, and other risk factors as well as hemodynamics relating to AF are reflected in the PPG. This project will generate important basic knowledge on the clinical use of PPG and at the same time investigate the hemodynamics of AF, the most common arrhythmia worldwide.

Methods:
We will develop deep neural networks (DNN) for detecting hemodynamical patterns related to AF based on PPG recordings and characteristics from three independent cohorts comprising >6500 patients. Specifically, we will apply a DNN to (1) use PPG signals to distinguish patients with a risk factor (e.g. diabetes) from a patient without, (2) investigate how the hemodynamic changes before, during and after AF in PPG signals, and (3) distinguish between a patient’s hemodynamical pattern before and after they have received ablation therapy with PPG signals. To allow for linkage between the PPG signal and the outcome, we will specifically develop and apply explainable AI (xAI) methods for PPG analysis. xAI allows for a visual interpretation ofthe otherwise hidden decision-making of the DNN and graphically depicts the linkage of the
signal to the outcome. xAI has previously been used with ECG analysis and in this project, we will develop the method for use with PPG signals for characterisation of hemodynamics associated with the risk factors, paroxysmal AF, and AF management.

Perspectives

This project will provide a novel understanding of PPG necessary for future clinical use and investigate unknown mechanisms of AF. Firstly, we will characterize the effect of prevalent risk factors on the PPG with huge implications for PPG algorithm development. We will also determine to what degree PPG may be used as a gatekeeper for further diagnostic work-up and reduce the number of unnecessary tests for the benefit of patients and society. Secondly, we will generate important knowledge on AF mechanisms and on how hemodynamics are reflected in PPG signals and our findings will be part of the scientific foundation necessary for the use of PPG in healthcare, whether driven by industry or academia. Finally, this project will help gain mechanistic information on ablation as a treatment for AF and might eventually help inform personalized treatment.

Malene Nørregaard Nielsen

Illuminating the Potentials of Photoplethysmography in Patients with Atrial Fibrillation using Explainable Artificial Intelligence

Abstract: Aim: This project investigates the unexplored clinical potentials of...

Amalie Koch Andersen

Position: Risk engine tool to support early prevention of diabetes-related complications for people with prediabetes

Categories: Cross-Academy Fellows 2024, Fellows

Location: Aalborg University

Abstract:

More than 650 million people suffer from prediabetes worldwide and the prevalence is increasing rapidly. A large part of these people will eventually develop microvascular and macrovascular complications generating a large economic burden on society. To prevent or delay onset of these complications, both lifestyle and pharmacological interventions are necessary. However, treatment tools or guidelines specifically for prevention of complications for this group does not exist in general practice. To address this challenge, the project seeks to improve the management of people with prediabetes by developing a decision support system to be implemented at the general practitioner. Based on a prediction of the personalized risk of micro- or macrovascular complications and a risk stratification, individuals with high- risk profiles will be identified. Additionally, different scenarios with lifestyle and pharmacological interventions will be simulated. This novel prediabetes risk engine tool will support informed treatment and early prevention strategies at the general practitioner, aiming to prevent or delay the onset of complications. Data from clinical studies and Danish national registers will be analyzed using data science techniques to identify patterns which are important for prediction of diabetes-related complications. Additionally, Artificial Intelligence including machine learning methodology will be used to develop the prediction model. No studies regarding prediabetes have investigated development and implementation of a flexible model allowing usage with only a limited amount of clinical data, and with a possibility of entering further data to increase precision of the risk estimate. Therefore, this project will focus on development of a flexible predictive model aimed at estimating the personalized risk of micro- or macrovascular complications among individuals with pre-diabetes.

Amalie Koch Andersen

Risk engine tool to support early prevention of diabetes-related complications for people with prediabetes

Abstract: More than 650 million people suffer from prediabetes worldwide...

Jasmin Hjerresen

Position: Genetic regulation of the plasma lipidome and its link to cardiometabolic disease in Greenlandic Inui

Categories: Cross-Academy Fellows 2024, Fellows

Location: University of Copenhagen

Abstract:

Cardiometabolic diseases including type 2 diabetes (T2D), cardiovascular disease (CVD), and obesity pose a growing global health problem, and a decay in the public metabolic health in Greenland is associated with westernization of diet and lifestyle. The genetic architecture of the modern Greenlandic population is shaped by its demographic history, geographic isolation in an Arctic
climate, and small population size, resulting in strong genetic drift and a high frequency of highimpact gene variants. Although genetic variants with high impact on metabolic health have already
been described, the genetic regulation of the plasma lipidome and its link to cardiometabolic diseases is poorly understood.

Using a state-of-the-art high-throughput mass spectrometry-based lipidomics, we aim to integrate plasma lipidomics data and genetic data from 2,539 Greenlandic individuals to better understand the link between lipid species and metabolic health. A study visit to Swedish University of Agricultural Sciences for collaboration will provide this project with nuclear magnetic resonance lipidomics
analysis, contributing with quantitative and qualitative knowledge of the study population and the potential identification of novel compounds. With genome-wide association studies, mapping of lipid quantitative trait loci (lQTLs), and colocalization analyses, we will examine the cross-sectional associations between lipid profiles, registry-based data on cardiometabolic outcomes, and genetic data to identify prognostic biomarkers and investigate biological pathways related to T2D, CVD, and
obesity. We hypothesize to see changes in the plasma lipidome in genetic loci linked to cardiometabolic disorders due to genetic drift of the Greenlandic population accompanied by westernized diet and lifestyle.

This project could offer novel insight into genetic etiology of cardiometabolic diseases to improve our understanding of molecular disease mechanisms and reveal novel targets for disease treatment and prevention in a broader perspective. The discovery of novel high-impact genetic variations associated with altered lipid profiles can contribute to understanding of metabolic health in Greenland and highlights the implications this research has for genetic precision medicine

Jasmin Hjerresen

Genetic regulation of the plasma lipidome and its link to cardiometabolic disease in Greenlandic Inui

Abstract: Cardiometabolic diseases including type 2 diabetes (T2D), cardiovascular disease...

Manuel Mounir Demetry Thomasen

Position: Development of vocal biomarkers for predicting diabetes-related complications using deep learning

Categories: Cross-Academy Fellows 2024, Fellows

Location: Aarhus University

Abstract:

Background Diabetes is a complex chronic condition with severe potential complications, which poses a huge burden on people with the condition, their families, and the healthcare sector. Risk assessment tools facilitating early detection of complications are crucial for prevention and progression management. Progression of diabetes and corresponding physiological changes affect several organs involved in the production of voice and speech. Vocal biomarkers are signatures, features, or a combination of features from the audio signal of the voice, that is associated with a clinical outcome and can be used to monitor patients, diagnose a condition, or grade the severity of a disease. Vocal biomarkers for diseases affecting the nervous system are well-established, but there is also some evidence for a potential in diabetes and cardiovascular research. Therefore, this project focuses on cardiovascular disease (CVD), neuropathy, and diabetes distress as clinical outcomes. Previous studies have been rather small, therefore there is also a need to establish new data collection with a focus on diabetes-related complications.

Aims This interdisciplinary project aims to develop and integrate novel vocal biomarkers in risk assessment of diabetes-related complications. The work will involve (1) data collection, creating new resources for further research in an emerging field, and (2) development of machine learning methods and models that might reveal important clinical knowledge about diabetes-related complications: cardiovascular disease, neuropathy and diabetes distress.

Methods First, machine learning models will be pre-trained on large datasets (audio and image) for various audio prediction tasks. These models will then be fine-tuned for the clinical prediction tasks with a method called transfer learning. These models will predict the presence of CVD, neuropathy, and the level of diabetes distress. For prediction of diabetes distress, voice data will be combined with features extracted from answers to open-ended questions with large language models. Model performance will be evaluated in internal test sets and validated in a global datasource (Colive Voice) during a research stay at the Luxembourg Institute of Health.

Perspectives The proposed project will contribute with valuable insight on how voice data can be used in risk assessment of diabetes-related complications. The project is expected to generate both methodological results (e.g. pre-trained models, new data sources for machine learning research) and clinically relevant tools (e.g. vocal biomarkers) that might contribute to innovative ways of monitoring diabetes-related complications in the future.

Manuel Mounir Demetry Thomasen

Development of vocal biomarkers for predicting diabetes-related complications using deep learning

Abstract: Background Diabetes is a complex chronic condition with severe...

PhD Fellows 2023

Asbjørn Munk

Position: Toward Theoretically Grounded Domain Adaptation for Brain Image Segmentation

Categories: Fellows, PhD Fellows 2023

Location: University of Copenhagen

Abstract:

This research proposal is a collaboration between the Machine Learning Group at University of California, Berkeley, Bruce Fischl’s group at Harvard Medical School, and the Medical Image Analysis Group at University of Copenhagen.

Deep learning has shown tremendous success in brain image analysis, aiding clinicians and researchers in detecting and treating a wide range of diseases from various data sources such as CT and MR images. A common task is to perform segmentation on images. However, the field is fundamentally limited by a lack of labeled training data and substantial variations in the data available. As a result of this, models often exhibit a lack of robustness to changes in equipment, patient cohorts, hospitals, and scanning protocols.

Because segmentation models have very large hypothesis spaces, existing domain adaptation theory and methodologies have failed to alleviate these problems. Current methodologies are either not grounded in theory or impractical to apply to segmentation models, where the large hypothesis spaces make it intrinsically difficult to overcome numerical issues or achieve noteworthy performance.

To push forward the field of brain image analysis, there is a need for theoretically well-founded domain adaptation methods. This project aims to work towards such methods, by boldly conducting theoretical work with the world-leading machine learning group at Berkeley, and apply this work to brain image segmentation, in collaboration with Bruce Fischl’s world-leading group at Harvard Medical School. The project is to be centered at the Medical Image Analysis group at University of Copenhagen, which is internationally recognized for applying cutting-edge machine learning practices to medical image analysis problems.

If successful, this project will lead to more robust models, a fundamental contribution towards utilizing the vast amount of medical images which is being produced at hospitals worldwide. This work will contribute towards providing technology for improving fundamental brain research as well as.

Asbjørn Munk

Toward Theoretically Grounded Domain Adaptation for Brain Image Segmentation

Abstract: This research proposal is a collaboration between the Machine...

Mikkel Werling

Position: Increasing Predictive Performance and Generalizability of AI in Healthcare Using Meta-learning and Federated Learning in an International Collaboration

Categories: Fellows, PhD Fellows 2023

Location: University of Copenhagen

Abstract:

In recent years, artificial intelligence has shown remarkable results in computer vision, natural language processing, and image generation. But in many domains within health, progress in predictive models has stagnated. Algorithms often show (1) low prediction accuracies and (2) poor generalizability beyond training data. Low prediction accuracies are largely the results of ubiquitous low-resource settings in health and an inability to incorporate data from different sources (e.g., different countries and different data modalities). The problem of generalizability is mainly due to algorithms being trained on data from a single site but rarely benchmarked on external data, leading to overfitting and vulnerability to data shifts.

In this project, we address the problem of low prediction accuracies and generalizability in the specific domain of chronic lymphocytic leukemia (CLL), where progress in prognostic models has stagnated.

We increase prediction accuracies by developing a novel meta-learning framework capable of handling multiple data modalities and multiple outcomes. This allows us to include multiple data sources as well as combine information from related diseases (multiple myeloma and lymphoma primarily) (Figure 1A), drastically reducing the number of samples needed for state-of-the-art performance.

We address the problem of generalizability by spearheading an international collaboration across four different countries. By combining federated learning with a model capable of domain adaptation, we overcome the issue of heterogeneity in the data from different countries thereby producing internationally robust results (Figure 2B). We establish a global benchmark, allowing us to assess the international generalizability of our model.

By providing a proof-of-concept of the value of learning from multiple diseases, we revolutionize how we think about patient data in health. Using CLL as a litmus test, this project will generate a roadmap for overcoming some of the biggest barriers in health machine learning (hML) and achieving state-of-the-art performance even in low-resource domains.

Mikkel Werling

Increasing Predictive Performance and Generalizability of AI in Healthcare Using Meta-learning and Federated Learning in an International Collaboration

Abstract: In recent years, artificial intelligence has shown remarkable results...

Thomas Gade Koefoed

Position: Resolving Insulin Resistance Heterogeneity in Human Skeletal Muscle Using Multimodal Single-nucleus Sequencing

Categories: Fellows, PhD Fellows 2023

Abstract:

Insulin resistance (IR) is a key characteristic of type 2 diabetes (T2D) – a common and severe condition characterized by dysregulated blood glucose levels. Despite considerable efforts to map the complex characteristics of IR and T2D, detailed characterizations of IR in some important metabolic tissues, such as skeletal muscle, are still lacking.

In this project, we propose to use a high-throughput, state-of-the-art single-nucleus sequencing assay to gain cutting-edge biological insight into the transcriptomic, epigenetic, and cellular characteristics of IR in skeletal muscle. Furthermore, we will use the generated data to investigate the pivotal role of this tissue in the development of IR and T2D. Specifically, we will determine which muscle cell types mediate the most heritable risk of IR and T2D, potentially elucidating novel targets for treatment. Finally, we will investigate whether cell-type-specific polygenic risk scores can enable better prediction of a patient’s disease comorbidities and drug responses when compared to the use of traditional, non-cell-type-specific polygenic risk scores. No such analysis has yet been performed for human skeletal muscle, and the resulting stratification of heterogenous IR and T2D patient groups would constitute an important advancement in precision medicine.

The single-nucleus assay will be performed by the Hansen group for Genomic Physiology and Translation in collaboration with the Single-Cell Omics Platform at the Novo Nordisk Foundation Center for Basic Metabolic Research. The full dataset will be generated before the start of the project in Q3 2023, at which point the PhD-candidate will start computationally analyzing the data, drawing upon state-of-the-art bioinformatic tools and machine learning models. Importantly, the proposal is based on proof-of-concept data from one skeletal muscle sample, which is included in the project description. Additionally, the project is based on multiple national and international interdisciplinary collaborations, including supervisors from both clinical and technical backgrounds and a six-month research stay at the Broad Institute of Harvard and MIT, Boston, USA.

Finally, it should be noted that the bioinformatic analyses in this project can be generalized to any heritable disease and tissue. We, therefore, believe that the knowledge and methodological advancements gained from the project will have a wider clinical impact beyond skeletal muscle and metabolic diseases.

Thomas Gade Koefoed

Resolving Insulin Resistance Heterogeneity in Human Skeletal Muscle Using Multimodal Single-nucleus Sequencing

Abstract: Insulin resistance (IR) is a key characteristic of type...

Arman Simonyan

Position: New Machine Learning-Driven Approaches for Peptide Hormone and Drug Discovery

Categories: Fellows, PhD Fellows 2023

Location: University of Copenhagen

Abstract:

Two-thirds of human hormones act through ~800 G protein-coupled receptors (GPCRs). The vast majority (71%) of these hormones are peptides or proteins, which also account for an increasing share of drugs. The study of peptide-receptor recognition is thus essential for understanding physiology, pathology and for drug design.

This project aims to solve the modeling problem of peptide-receptor recognition by leveraging machine learning methods and unique data from the field hub GPCRdb. I will build predictive graph neural network models representing residue interaction networks across the receptor-peptide interface. The models will utilize attention-based transformer and LSTM architectures which have shown great promise in drug-target interaction prediction and de novo-drug design. The models will be trained on a unique data representation, storing data for individual residues rather than the overall protein. This will allow peptide data to be inferred across conserved residues in different receptors – enabling use on receptors not targetable with classical methods.

The trained models will be used in three applied aims to: (1) discover peptide hormones by matching the library of predicted physiological peptide hormones to their cognate receptors with unknown physiological ligand and function; (2) identify peptide probes by matching pentameric peptide library to understudied and drug target receptors; and (3) holistically engineer probes for those receptors residue-by-residue. The in silico discovered probes will be tested in vitro by pharmacological collaborators. In all, this will let me discover novel hormones and engineer new probes, enabling functional characterization of understudied receptors that cannot be targeted with current techniques.

This project has the potential to uncover mechanisms of peptide-receptor recognition underlying physiological, sensory, and therapeutic responses. This will lay the foundation for exploring uncharted receptor functions and designing better drugs. Given this and that our approach will be applicable

Arman Simonyan

New Machine Learning-Driven Approaches for Peptide Hormone and Drug Discovery

Abstract: Two-thirds of human hormones act through ~800 G protein-coupled...

Mikkel Runason Simonsen

Position: Improving the Clinical Utility of Prognostic Tools through Calibration

Categories: Fellows, PhD Fellows 2023

Location: Aalborg University

Abstract:

For a wide range of medical conditions, prognostic models are used routinely to inform patients about their outlooks, guide treatment choice, and recruit patients into clinical trials. However, many prognostic models are developed and used only knowing the discriminatory capacity of the model, and not the model calibration and clinical utility. This PhD program aims to develop a method that can improve calibration, and thus clinical utility, of prognostic models, such that they will apply in heterogenous clinical settings across borders and continents. Additionally, new prognostic models for specific hematological cancers that outperforms existing models will be developed.

The project consists of two elements. Firstly, we will develop a new methodology to improve external validation of prognostic models particularly aiming at improving model calibration. This is of particular interest as new prognostic models developed in a Danish setting may not perform as well in other countries with different clinical standards, background mortality, and culture. Secondly, we will develop new prognostic models within hematological cancers using the newly developed methodology in combination with machine learning and artificial intelligence (AI) approaches. Denmark holds numerous comprehensive clinical registers, which the model development will be based on.

Development of a methodology for improving performance, particularly model calibration, of prognostic models will allow for the development of prognostic models that perform well in a variety of economic, cultural, and clinical settings. Improving the precision of prognostic models will provide health care planners, patients, and clinicians with a better foundation for making important clinical decisions. For instance, accurate prognostic models for hematological cancers can identify high-risk patients more accurately at the time of diagnosis, which can be used to guide treatment or recruit patients for clinical trials. Identification of low-risk patients is also important as these will be candidates for de-escalating treatment, which can avoid severe side effects from the treatment.

Mikkel Runason Simonsen

Improving the Clinical Utility of Prognostic Tools through Calibration

Abstract: For a wide range of medical conditions, prognostic models...

Javier Garcia Ciudad

Position: Modelling electrophysiological features of sleep and the variation induced by differences in species, gender, age, or disease using deep learning

Categories: Fellows, PhD Fellows 2023

Location: University of Copenhagen

Abstract:

The purpose of this project is to expand our knowledge about the electrophysiological features of sleep, with a particular focus on establishing links and differences between human and mouse sleep in both healthy and narcoleptic phenotypes. Narcolepsy is a sleep disorder characterized by excessive daytime sleepiness. Mouse models are often used to study narcolepsy by introducing specific pathological changes with gene manipulation techniques. Both in humans and mice, sleep and narcolepsy are often studied using electrophysiological signals. Still today, these signals are mainly analyzed by manual annotation of different sleep stages. In recent years, deep learning scoring models have been introduced, though without becoming widely implemented.

These models apply just to humans or just to mice, which is partly motivated by a lack of understanding of how much human and mouse sleep have in common. Finding similarities between both would support the development of common scoring models. More importantly, it would allow causal links to be made between the specific pathological changes modeled in mice and the human disease, which is one of the major challenges in narcolepsy research. In addition, finding electrophysiological signatures of narcolepsy or other factors such as age or gender would enhance our understanding of narcolepsy and sleep.

For this purpose, sleep signals will be studied using state-of-the-art deep learning methods. Sleep scoring models based on transformers and convolutional and recurrent neural networks will be studied to investigate how well they translate between the human and mouse domain. In addition, representation learning using variational autoencoders and contrastive learning techniques will be employed to learn compact representations of sleep signals, with the goal of providing species-invariant representations and identifying individual variabilities from the signals. The learned representations will be projected to lower- dimensional latent spaces, in which evaluating the distance between groups. Finally, explainable AI techniques will be investigated to extract insights from the models used, which could reveal EEG biomarkers of species, disease state and other individual variabilities.

Javier Garcia Ciudad

Modelling electrophysiological features of sleep and the variation induced by differences in species, gender, age, or disease using deep learning

Abstract: The purpose of this project is to expand our...

Sebastian Loeschke

Position: Tensor Networks for Efficient Image Rendering

Categories: Fellows, PhD Fellows 2023

Location: IT University of Copenhagen

Abstract:

Efficient and realistic image rendering (IR) has long been a focus of research. Machine learning (ML) techniques for IR have enabled the creation of complex and photorealistic images. Despite recent advances, these techniques are often slow and memory-intensive, limiting their practical use.

This Ph.D. proposal aims to explore the potential of quantum-inspired tensor network (TN) methods for IR tasks, with the goal of reducing memory and computational costs. TNs are versatile and powerful scientific simulation tools that have been successful in simulating strongly correlated quantum many-body systems and quantum circuits. TNs have also been used to compress deep neural networks, leading to significant memory savings in ML applications. However, TNs have not been utilized as extensively as neural networks in ML, and the development of tools and techniques for training them has been limited.

This project will develop novel algorithms and techniques that leverage TNs’ full capabilities in an ML and IR setting to achieve real-time or animated 3D IR at high precision. The project will identify promising TN embeddings for images and scenes, and develop efficient learning algorithms for constructing them. Specific projects include exploring discrete vs. continuous TN embeddings, upsampling methods, and incorporating TNs into normalizing flows and diffusion models to improve representational power and inference time.

This project has the potential to significantly contribute to the fields of ML, IR, quantum computation, and life sciences, which heavily rely on the analysis of large datasets. By developing efficient IR techniques, this project aims to make IR more practical and accessible, benefiting fields such as medical imaging, gaming, virtual and augmented reality, and robotics. Additionally, TN methods have the potential to significantly reduce the carbon footprint of ML applications by developing more efficient algorithms that can process large datasets with fewer computational resources. This will not only benefit the environment but also democratize ML by making it more accessible to a wider range of individuals. In addition, the use of TNs allows for better explainability compared to deep learning models. Lastly, this project will contribute to the collaboration between the quantum and ML communities and also help map out the landscape where TN algorithms provide an advantage, paving the way for future advancements in quantum-native algorithms.

Sebastian Loeschke

Tensor Networks for Efficient Image Rendering

Abstract: Efficient and realistic image rendering (IR) has long been...

Jakob Lønborg Christensen

Position: Diffusion Models for Image Segmentation

Categories: Fellows, PhD Fellows 2023

Location: Technical University of Denmark

Abstract:

Image segmentation is an important research area that has, due to deep learning, seen great advances in recent years. There are still problems to solve, especially when annotated data is scarce. We propose a PhD project aiming to unify agnostic segmentation models with the diffusion process. We argue this is a good idea since many of the ideas in diffusion can be applied to segmentation.

Recent diffusion model developments have been focused largely on the text-to-image domain. Adapting these methods to segmentation can give rise to useful models with human-in-the-loop or few-shot capabilities. The PhD has the potential to be valuable for collaborators of the Visual Computing section at DTU, while also having the potential for larger impacts in the research area as a whole. The applicant, Jakob Lønborg Christensen, is an honours programme student at DTU with multiple peer-reviewed publications. This PhD project would benefit significantly from not being bound to a specific application area or a specific dataset.

Jakob Lønborg Christensen

Diffusion Models for Image Segmentation

Abstract: Image segmentation is an important research area that has,...

Christoffer Sejling

Position: New Methods for Functional Data to quantify Clinically Relevant Traits of CGM (Continuous Glucose Monitor) Measurement Patterns and guide Clinical Decision Making

Categories: Fellows, PhD Fellows 2023

Location: University of Copenhagen

Abstract:

Diabetes and prediabetes are increasingly prevalent conditions in modern society, both of which are associated with numerous health hazardous conditions such as obesity, hypertension, and cardiovascular disease. In itself, type 1 diabetes (T1D) is a life changing diagnosis, forcing a need for constant health awareness. When dealing with these challenges, a continuous glucose monitor (CGM) is a vital tool that helps patients evaluate their own health and helps inform clinical decision making in a cost effective manner. Use of CGM devices is therefore becoming more and more common in diabetes clinics around the world, where data from CGMs are collected and analyzed with the objective of optimizing patient care. The increasing adoption of CGMs brings about a huge potential for improving care by developing a data-driven methodology that can be used to assess the CGM data. However, since only simplistic methods based on different summary statistics have been attempted in clinical practice, we still need to uncover the full potential of the information production in CGM measurements.

In this project, we aim at further developing the statistical methodology for drawing out information from CGM trajectories by making use of complex features such as slope, locality, and temporality. In particular we seek to carry out prediction and statistical inference for clinically relevant outcomes on that basis. Additionally, we aim at estimating causal effects, which may help guide clinical decision making. As outcomes, we consider the occurrence of entering and leaving a state of remission as well as the occurrence of entering a state of hypoglycemia for T1D patients at Steno Diabetes Center Copenhagen. We specifically seek to enhance performance in the prediction of these clinical occurrences and the identification of clinically meaningful attributes by taking advantage of the longitudinal calendar order of the observed CGM trajectories for each patient.

In summary, we aim at obtaining a characterization of CGM trajectory shapes that provides accessible, usable, and valid information, on which clinicians may base their assessments and decisions.

Christoffer Sejling

New Methods for Functional Data to quantify Clinically Relevant Traits of CGM (Continuous Glucose Monitor) Measurement Patterns and guide Clinical Decision Making

Abstract: Diabetes and prediabetes are increasingly prevalent conditions in modern...

Jette Steinbach

Position: The Impact of Genetic and Environmental Heterogeneity on Health Prediction Models

Categories: Fellows, PhD Fellows 2023

Location: Aarhus University

Abstract:

The ability to predict disease risk and identify individuals at high risk for developing a certain disease is fundamental to modern healthcare, since it enables the implementation of preventive measures and personalized treatments. Polygenic scores (PGS) have received attention for their promise to improve clinical prediction models. Recently, electronic health records (EHR) have also proven to enhance prediction accuracy. However, the accuracy of both PGS and EHR in clinical prediction models is impacted by individual genetic, environmental and diagnostic heterogeneity, which can lead to racial, gender, and ancestry-based biases. It is important to understand and measure the impact and severity of these types of heterogeneities, in order to develop more inclusive, accurate and robust prediction models. These models need to be evaluated and replicated across cohorts and in individuals of different genetic ancestries.

The proposed PhD project intends to address this by evaluating the impact of these heterogeneities on the predictive performance of PGS, EHR and informed family history (FH) within and across cohorts and ancestries. It will do so by studying the effect of genetic and environmental heterogeneity on the prediction accuracy for numerous health outcomes, characterizing differences in EHR across populations, and providing more robust prediction models that incorporate EHR, PGS and FH.

This PhD project aims to contribute with high-quality research to the field of psychiatric epidemiology and psychiatric genetics by providing insight into the predictive accuracy of prediction models across ancestries and cohorts. It intends to provide a deeper knowledge about the impact of genetic and environmental heterogeneity on the predictive performance of PGS, informed FH and EHR, and may serve as a guide for future research on the development of clinical prediction models.

Jette Steinbach

The Impact of Genetic and Environmental Heterogeneity on Health Prediction Models

Abstract: The ability to predict disease risk and identify individuals...

Postdoc Fellows 2023

Luigi Gresele

Position: Causal Representation Learning: Conceptual Foundations, Identifiability and Scientific Applications

Categories: Fellows, Postdoc Fellows 2023

Location: University of Copenhagen

Abstract:

Representation learning and causality are fundamental research areas in machine learning and artificial intelligence. Identifiability is a critical concept in both fields: it determines whether underlying factors of variation can be uniquely reconstructed from data in representation learning, and specifies the conditions for answering causal queries unambiguously in causal inference. Causal Representation Learning (CRL) combines these two fields to seek latent representations that support causal reasoning. Recent theoretical advances in CRL have focused on the identifiability of a ground truth causal model. In this research proposal, I present two projects aimed at investigating previously unexplored aspects of CRL.

The first project aims to challenge the assumption of a unique ground truth causal model, by acknowledging that the same causal system can be described using different variables or levels of abstraction. To address this, we plan to investigate novel notions of identifiability, where the true model is reconstructed up to classes of causal abstractions consistently describing the system at different resolutions. We will also search for conditions under which these models can be learned based on available measurements. By doing so, we aim to clarify the conceptual foundations of CRL and inspire the development of new algorithms.

The second project aims to investigate latent causal modelling in targeted experiment, exploiting the rich experimental data and partial knowledge available in scientific domains to refine the CRL problem. Specifically, we will focus on neuroimaging experiments with treatment and control groups, with the objective of isolating the impact of covariates specific to the treatment group on functional brain data, disentangling it from responses elicited by the experimental protocol, shared across both groups. An additional difficulty stems from the variability in the coordinatizations of brain functional activities across different subjects due to anatomical differences. We plan to extend our previous work on group studies in neuroimaging to address these challenges. The outcome of this project could have a significant impact on scientific applications of machine learning, also beyond cognitive neuroscience.

In summary, my proposed research projects have the potential to advance the state-of-the-art in Causal Representation Learning, clarifying its conceptual foundations and enabling its application to real-world problems.

Luigi Gresele

Causal Representation Learning: Conceptual Foundations, Identifiability and Scientific Applications

Abstract: Representation learning and causality are fundamental research areas in...

Dustin Wright

Position: Supporting Faithful Reporting on Scientific Research with AI Writing Assistant

Categories: Fellows, Postdoc Fellows 2023

Location: University of Copenhagen

Abstract:

Science reporting is not an easy task due to the discrepancy between scientific jargon and lay terms, as well as a discrepancy between the language of scientific papers and associated news articles. As such, not all scientific communication accurately conveys the original information, which is exemplified by skewed reporting of less technical topics and unfaithful reporting of scientific findings. To compound this problem, the average amount of time journalists can spend on individual articles has decreased due to funding cuts, lack of space, and increased commercialization. At the same time, the public relies on the media to learn about new scientific findings, and media portrayal of science affects people’s trust in science while at the same time influencing their future actions [7,26,27].

My project proposes to develop natural language processing (NLP) tools to support journalists in faithfully reporting on scientific findings, namely, tools for extracting key findings from scientific articles,, translating scientific jargon into lay language, and generating summaries of scientific articles in multiple languages while avoiding distortions of scientific findings.

In two recent studies which I led [20,21], we investigated automatically detecting exaggeration in health science press releases as well as general information change between science reporting and scientific papers, and found that large pre-trained language models can be successfully exploited for these tasks. This project will leverage my previous research and will be much more ambitious, focusing on: 1) detecting distortions between news articles and scientific articles in different languages and across multiple areas of science; 2) using a model which can detect such distortions to automatically generate more faithful news articles; 3) analyzing texts in the difficult domains of medicine, biology, psychology, and computer science research, which I have worked with previously and which garner some of the most media attention. This will result in trained models which can be used as writing assistants for journalists, helping to improve the quality of scientific reporting and information available to the public. In addition, the project will involve international collaboration with the University of Michigan, including a research stay in order to leverage their expertise and resources, as well as develop my competencies as a researcher.

Dustin Wright

Supporting Faithful Reporting on Scientific Research with AI Writing Assistant

Abstract: Science reporting is not an easy task due to...

Beatriz Quintanilla Casas

Position: Exploratory Gastronomy (EXPLOGA): Turning Flavour Chemistry Into Gastronomy Through Advanced Data Analysis

Categories: Fellows, Postdoc Fellows 2023

Location: University of Copenhagen

Abstract:

Today’s design and production of food products are still based on human artisan skills, especially when it comes to high-quality products where blending of raw materials is key. The development of new data science tools plays a key role on this food transition, as they can allow to comprehensively exploit the current knowledge while uncovering new connections. Therefore, the proposed project named EXPLOGA – Exploratory Gastronomy pursues to improve food design and production practices, in order to make them more efficient and sustainable, by means of developing new scientific data tools.

These new tools will be able to convert food flavour measurements into chemically and gastronomically well-defined information, through automated untargeted profiling of flavour data as well as advanced text analysis of the existing flavour information. EXPLOGA represents the first level of a new field of research we name functional gastronomy approach, which aims to use data science to better understand the influence of raw materials and processing techniques on the final food products in a broad sense.

This project will be carried out at the Chemometrics group at the Department of Food Science (University of Copenhagen), supervised by Prof. Rasmus Bro. It will also include a three-months international stay at the Norwegian Food Research Institute (NOFIMA).

Beatriz Quintanilla Casas

Exploratory Gastronomy (EXPLOGA): Turning Flavour Chemistry Into Gastronomy Through Advanced Data Analysis

Abstract: Today’s design and production of food products are still...

Ignacio Peis Aznarte

Position: Implicit Neural Representations Generation for Efficiently Handling Incomplete Data

Categories: Fellows, Postdoc Fellows 2023

Location: Technical University of Denmark

Abstract:

Inference-friendly deep generative models such as Variational Autoencoders have shown great success in modelling incomplete data. These models typically infer posteriors from the observed features and decode the latent variables to impute the missing features. Recent deep generative models are well suited for modelling structured data like images, sequences, or vector-valued numbers, and they use neural architectures specifically tailored to the data type. Unfortunately, using these networks for grid-type data necessitates pre-imputation methods, such as zero-filling missing patches, leading to biased inference.

In contrast, Implicit Neural Representations (INRs) model complex functions that map coordinates to features in a point-wise setting using feedforward neural networks, independently of the data type and structure. As a consequence, they infer knowledge only from observed points, thus overcoming the aforementioned bias. Although Markov Chain Monte Carlo (MCMC) methods have been widely used to improve inference in classical deep generative models of structured data, their effectiveness in models of INRs is still an open research question.

My proposed project aims to revolutionize deep generative modelling by leveraging the power of Implicit Neural Representations (INRs) to model incomplete data without introducing any bias. By i) creating novel deep generative models of INRs, and ii) proposing novel MCMC-based inference methods for these models, we can overcome the limitations of existing techniques and open new directions for using MCMC-based inference in generative models of INRs. These groundbreaking contributions have the potential to transform the field of deep generative modelling and have significant implications for how they handle missing data.

Ignacio Peis Aznarte

Implicit Neural Representations Generation for Efficiently Handling Incomplete Data

Abstract: Inference-friendly deep generative models such as Variational Autoencoders have...

Daniel Murnane

Position: Learning the Language of Reality: A Multi-tasking, Multi-scale Physics Language Model for High Energy Physics

Categories: Fellows, Postdoc Fellows 2023

Location: Niels Bohr Institute, University of Copenhagen

Abstract:

The search for new physics beyond the Standard Model at the Large Hadron Collider (LHC) at CERN has been an elusive quest, despite the billion-euro machinery and extremely sensitive detectors used in the experiment. To overcome this obstacle, I propose a project to develop a novel machine learning (ML) approach called a Physics Language Model (PLM).

The PLM is a graph neural network (GNN) that maintains multiple scales of information about the energy deposits across the ATLAS detector located at the LHC. Instead of discarding fine details as is currently done, the PLM uses a hierarchical structure to pay attention to the most relevant scales and features of the physics data. This approach can also be trained on a variety of physics tasks and, in other domains such as protein property prediction, has been shown to outperform single-task models. Novel developments in the field of high energy physics (HEP) should be expected to feedback to improve Biological and Chemical Language Models.

The current HEP paradigm is to work on a discrete task in the physics analysis chain, using only the scale and granularity of the data produced in the previous stage. Modern ML models, and large language models (LLMs) such as GPT in particular, are a complete inversion of this paradigm. They instead gain expressivity from learning emergent patterns in the fine details of many datasets and tasks. In my role as Machine Learning Forum Convener for ATLAS, and with current collaborations with Berkeley Lab, DeepMind, Columbia University, Copenhagen University and Georgia Tech on this topic, I believe the time has come to use the available data, physics tasks, and huge compute availability to build a prototype PLM.

The PLM could greatly increase the discovery power for new physics at the LHC by reviving the data that is currently discarded. This is a unique opportunity, as algorithm choices for the High Luminosity LHC (HL-LHC) upgrade will be finalized within 18 months. If trends in natural language ML can be captured in physics, a PLM can also be expected to grow exponentially in power with increasing dataset and model size.

Daniel Murnane

Learning the Language of Reality: A Multi-tasking, Multi-scale Physics Language Model for High Energy Physics

Abstract: The search for new physics beyond the Standard Model...

Laura Helene Rasmussen

Position: When Winter is Weird: Quantifying the Change in Winters Across the Arctic

Categories: Fellows, Postdoc Fellows 2023

Location: University of Copenhagen

Abstract:

Arctic winter climate is rapidly changing, with more variable snow depths, spring snowmelt timing, and more frequent midwinter thaw events. Less predictable conditions disrupt ecosystem balances and development in Arctic communities, and understanding winter variability across the Arctic and its influence on climate the whole year is needed to mitigate consequences of changing winters. However, access to in situ measured data has been extremely limited and scattered in local databases. Hence, cross-Arctic winter studies are few and based on remotely sensed data with larger spatial and temporal coverage, but less local sensitivity, and the winter contribution to annual average temperature change has not been investigated across the Arctic.

In this project, we 1) obtain, clean and standardize in situ soil surface temperature, snow depth and soil moisture data from climate monitoring programs across the Arctic and create a unique database with cross-Arctic in situ winter climate data from the last appr. 30 years. We use this dataset to 2a) estimate the accuracy of remotely sensed soil surface temperature, snow depth and soil moisture data using the regression model with the best fit, and quantify the bias, for each major Arctic region. We further 2b) construct an open access Winter Variability Index (WVI) for each major Arctic region based on the winter phenomena (average snow depth, snowmelt date, frequency of winter thaw events) that are most important drivers of a clustering analysis such as PCA, hierarchical clustering or autoencoders. Finally, we 3) use the change in WVI and in annual mean temperatures for each decade in a function-on-function regression analysis, which will quantify the contribution of winter variability change to annual average temperature changes in each Arctic region.

The project will produce a comprehensive dataset with potential for further research and will improve our region-specific understanding of remotely sensed data accuracy, which is key for confidence in climate system modelling. The WVI allows scientists or local communities to classify Arctic winter data within a quantitative framework of pan-Arctic winter variability also in the future, and to understand how important changes in winter variability is for Arctic climate changes the whole year.

Laura Helene Rasmussen

When Winter is Weird: Quantifying the Change in Winters Across the Arctic

Abstract: Arctic winter climate is rapidly changing, with more variable...

PhD Fellows 2022

Fabian Martin Mager

Position: A Self-supervised Model of the Brain to Help Us Understand Mental Disorders

Categories: Fellows, PhD Fellows 2022

Location: Technical University of Denmark

Abstract:

Our brain is the most central part of the nervous system and crucial to our health and wellbeing. In Denmark, mental disorders make 25% of the total disease burden, with yet increasing prevalence. Across many medical domains, magnetic resonance imaging (MRI) is a widely used tool to study the anatomy and physiology of soft tissue, e.g. the brain, and is applied in diagnosis and monitoring of diseases, as well as a tool to investigate their underlying mechanisms. In psychiatry, research has yielded substantial evidence for structural brain changes at a group level, however these are typically subtle and currently, there is no clinical benefit from MRI for the individual patient.Previous research aiming to identify brain aberrations in patients with neuropsychiatric disorders struggle with relatively small and often inhomogeneous samples paired with complex clinical traits and weak pathological signals.

To unravel the intricacy of mental disorders and the brain, one approach is to apply powerful state-of-the-art machine learning algorithms, such as deep neural networks (DNNs). Large DNNs are able to extract high level features of images and other signals and able solve sophisticated tasks, outperforming traditional machine learning methods by far. On the downside, a DNN requires a large amount of ‘labelled data’, e.g., where each brain image has a meaningful notation, such as ‘patient’ or ‘control’. The amount of labelled data needed to train such a model is currently not available in conventional psychiatric research. In contrast, ‘unlabelled data’, e.g. normative brain images independent of a certain class or group, are often generously and publicly available. In the field of machine learning, scarcity of labelled data and richness of unlabeled data has given rise to self-supervised learning paradigms. In self-supervised learning one exploits rich unlabeled data to learn a general intermediate representation of the matter of interest. Scarcelabelled data is used efficiently to fine-tune the intermediate representation to a specific task of interest.

The aim of this project is to develop a self-supervised DNN model of the brain using MRI data of large international, high-quality databases. This model will then be fine-tuned using highly specific data from psychiatry to address specific research questions regarding the mechanisms of aberrant neurodevelopment. We believe a self-supervised model is more robust and able to learn more meaningful features compared to conventional models. To explore these features and their relation to clinical traits present in psychiatric patients, we want to employ explainable artificial intelligence techniques.

To sum up, we want to use self-supervised learning paradigms and utilize its efficient use of scarce labelled data to develop a state-of-the art DNN model of brain images, bringing neuropsychiatric research to the forefront of machine learning research.

Note: Since the date of recording the video, Fabian has chosen to adjust the scope and title of his research project, the title in the written article is current and correct

Fabian Martin Mager

A Self-supervised Model of the Brain to Help Us Understand Mental Disorders

Abstract: Our brain is the most central part of the...

Emilie Wedenborg

Position: Discovering Polytopes in High Dimensional, Heterogeneous Datasets using Bayesian Archetypal Analysis

Categories: Fellows, PhD Fellows 2022

Location: Technical University of Denmark

Abstract:

Real World Data (RWD), such as electronic medical records, national health registries, and insurance claims data provide vast amounts of high granularity heterogeneous data. An international standard (OMOP) has been developed for health data and accelerating evidence generation from RWD. EU has recently adopted the same standard for the European Health Data & Evidence Network (EHDEN), the largest federated health data network covering more than 500 million patient records. This allows standardization of datasets across institutions in 26 different countries, but a major data science challenge remains on how to tackle the volume and complexity of multimodal data of such magnitude.

The aim is to develop easily human interpretable tools to analyse RWD to extract distinct characteristics enabling new discoveries. The project includes a key industrial collaborator, H. Lundbeck A/S, that will provide additional guidance, contacts, and access to large sets of RWD in the OMOP format.

The project will focus on a prominent data science methodology called Archetypal Analysis characterized by identifying distinct characteristics, archetypes, and how observations are described in terms of these archetypes, thereby defining polytopes in high-dimensional data. This project will develop tools for uncovering such polytopes in large, high-dimensional, heterogenous, noisy, and incomplete data. We will develop Bayesian modeling approaches for uncertainty and complexity characterization, data fusion for enhanced inference, and deep learning methods to uncover disentangled polytopes.

The tool will advance our understanding of RWD and will accelerate real world evidence generation through the identification of patterns in terms of archetypes. Furthermore, trade-offs within archetypes can fuel personalized medicine by defining a profile of the individual patient in terms of a soft assigned spectrum between archetypes. We hypothesize this characterization has important use advancing our understanding of subtypes and comorbidities within different neurological and psychiatric disorders.

Emilie Wedenborg

Discovering Polytopes in High Dimensional, Heterogeneous Datasets using Bayesian Archetypal Analysis

Abstract: Real World Data (RWD), such as electronic medical records,...

Amalie Pauli Brogaard

Position: Computational Rhetorical Analysis: Transfer Learning Methods for Detecting Persuasion in Text

Categories: Fellows, PhD Fellows 2022

Location: Aarhus University

Abstract:

Misinformation and propaganda are recognised as a major threat to people’s judgement and informed decision making in health, politics and news consumption. The spread of misinformation relating to the Covid19 epidemic is just one prominent example. It is not only wrong facts that constitute a threat, but also the language used which can lead to deception and misleading of people. To address the misinformation threat and empower readers confronted with enormous amounts of information, we propose a new data science methodology for the computational analysis of rhetoric in text.

While rhetoric, the art of persuasion, is an ancient discipline, its computational analysis, regarding persuasion techniques, is still in its infancy. We propose a data science project on computational modelling and automatic detection of persuasion techniques at the intersection of Natural Language Processing (NLP) and Machine Learning. We posit that detecting and highlighting persuasion techniques enables critical reading of a text, thereby reducing the impact of manipulative and disingenuous content.

Knowing and understanding fallacies and rhetorical tricks may also help to make stronger, valid arguments in a variety of texts. Moreover, we expect rhetorical information to be beneficial to other semantic language processing tasks and we, therefore, devise approaches to capture and transfer rhetorical knowledge to models for such tasks.

This project will contribute novel models for detecting persuasion techniques, as well as new transfer learning methods for utilising rhetorical knowledge to benefit other semantic text analysis methods.

Amalie Pauli Brogaard

Computational Rhetorical Analysis: Transfer Learning Methods for Detecting Persuasion in Text

Abstract: Misinformation and propaganda are recognised as a major threat...

Gala Humblot-Renaux

Position: “Are you sure?” Towards Trustworthy Computer Vision with Deep Multimodal Uncertainty Estimation

Categories: Fellows, PhD Fellows 2022

Location: Aalborg University

Abstract:

Human perception is inherently uncertainty-aware (we naturally adapt our decisions based on how confident we are in our own understanding) and multimodal (we seldom rely on a single source of information). Analogously, we argue that trustworthy computer vision systems should (at the very least) (1) express an appropriate level of uncertainty, such that we can reliably identify and understand their mistakes and (2) leverage multiple complementary sources of information, in order to be sufficiently well-informed. While state-of-the-art deep neural networks (DNNs) hold great potential across a wide range of image understanding problems, they offer little to no performance guarantees at run-time when fed data which deviates from their training distribution.

Reliably quantifying their predictive uncertainty in complex multimodal computer vision tasks remains an open research problem, yet will be a necessity for widespread adoption in safety-critical applications. The aim of this project is therefore to conduct basic research in probabilistic deep learning and computer vision, investigating how uncertainty can be modelled and extracted in multimodal DNNs for image classification and segmentation.

We will adopt approximate Bayesian inference methods to separately capture data uncertainty and model uncertainty not only at the final prediction stage, but also in the intermediate feature fusion process, in order to adaptively weigh the contribution of each modality. We will develop novel uncertainty-aware deep fusion methods, and study them on real-world computer vision tasks across a broad range of high-stakes domains including multimodal medical image analysis. Our research will be an important step towards improving the transparency and robustness of modern neural networks and fulfilling their potential as safe, trustworthy decision-support tools.

Gala Humblot-Renaux

“Are you sure?” Towards Trustworthy Computer Vision with Deep Multimodal Uncertainty Estimation

Abstract: Human perception is inherently uncertainty-aware (we naturally adapt our...

Rasmus Christensen

Position: Computational Design of Disordered Electrode Materials for Batteries

Categories: Fellows, PhD Fellows 2022

Location: Aalborg University

Abstract:

Safe and efficient batteries is one of the key technologies for electrification of transport and sustainable energy storage and thus enabling the green transition. The intercalation-type Li-ion battery is by far the most studied and commercially successful battery type. Electrodes in these batteries have traditionally been ordered crystalline materials, but improvements in these materials’ capacity and stability are needed. Recent studies suggest that such improvements can be achieved by the use of electrode materials with different kinds of disorder, for example materials undergoing order-disorder transitions during charge/discharge cycling.

In this project, I propose to use topological data analysis and machine learning methods to enable the computational design of such disordered electrode materials with improved performance. To this end, I have divided the project into four tasks. First, atomic structures of the selected systems will be generated. This will be done using molecular dynamics simulations as well as based on experimental x-ray/neutron scattering data that are analyzed using reverse Monte Carlo, genetic algorithm, or particle swarm optimization algorithms. Second, topological features of these atomic structures will be identified using topological data analysis. When these data are combined with a classification-based machine learning algorithm, it will be possible to construct topological metrics that are correlated to the materials’ propensity to possess large tunnels that enable Li ion motion. Third, models for predicting the dynamics of the conducting Li ions will be constructed using graph neural networks.

Based on this analysis, the relative importance of the various structural environments surrounding the Li ions on their dynamics can be quantified. Fourth, the insights gained in the previous two tasks will be used to design new improved electrode materials based on high-throughput molecular dynamics simulations and machine learning regression models.

Taken as a whole, the proposed research will enable battery scientists to find “order in disorder” in a promising new family of electrode materials, which in turn will enable future development of novel batteries. Two experts in machine learning applications, disordered materials, and topological data analysis will supervise the project.

Rasmus Christensen

Computational Design of Disordered Electrode Materials for Batteries

Abstract: Safe and efficient batteries is one of the key...

Richard Michael

Position: Principled Bayesian Optimization and scientific ML for Protein Engineering

Categories: Fellows, PhD Fellows 2022

Location: University of Copenhagen

Abstract:

Protein Engineering has a wide range of applications from biotechnology to drug discovery. The design of proteins with the intended properties entails a vast discrete search-space and while we have computational representations and experimental observations available, we lack the methods to adequately combine all available information. In this proposed work we use the state of the art probabilistic optimization and propose a novel machine learning method: principled Bayesian Optimization on latent representations applied to protein variants.

We utilize abundant cheap experimental observations together with various latent information from deep stochastic models. We optimize the target function on a large data-set of lower quality experiments with respect to very scarce high quality experimental candidates through dual output Gaussian Processes. This method promises to predict the highest scoring variants given abundant noisy assay data. The goal is to significantly improve predictions of protein variant candidates with respect to intended function. This project is a collaboration between the Bio-ML group at the Department of Computer Science and the department of Chemistry under joint supervision at the University of Copenhagen.

The successful outcome of the research project would allow us to reduce required experimental time and resources through better computational protein variant proposals. We propose to achieve this by incorporating different data sources and account for epistemic and aleatoric noise.

Richard Michael

Principled Bayesian Optimization and scientific ML for Protein Engineering

Abstract: Protein Engineering has a wide range of applications from...

Peter Laszlo Juhasz

Position: Topological Data Analysis Based Models of Evolving Higher-Order Networks

Categories: Fellows, PhD Fellows 2022

Location: Aarhus University

Abstract:

Although the field of complex networks has been actively researched for several decades, higher-order networks describing group interactions have just recently gained special attention. At the expense of the richer description of the interacting components, more complex mathematical tools taken from the field of topological data analysis need to be applied for their study. Despite numerous studies in this field, the structural dynamics and evolution of higher-order networks are still not well understood as of today.

The goal of the proposed research project is to build topological models to detect and predict the structural dynamics of real-world higher-order networks. Among others, these models could shed light on the evolution of neural networks such as the human brain, the dynamics of scientific collaborations, or the prediction of group relationships in social networks.

Peter Laszlo Juhasz

Topological Data Analysis Based Models of Evolving Higher-Order Networks

Abstract: Although the field of complex networks has been actively...

Ida Burchardt Egendal

Position: Mutational Signatures Using Neural Networks for Robust Stratification of Cancer Patients

Categories: Fellows, PhD Fellows 2022

Location: Aalborg University

Abstract:

Somatic mutations play an integral role in the development of cancer. In the past decade the identification of patterns in the somatic mutations, called mutational signatures, has in- creased in popularity. These signatures are associated with mutagenic processes, such as DNA damage and sun exposure. Although the signatures contain vital information about tu- morigenesis, there is a lack of confidence in the signatures which are estimated predomi- nantly by non-negative matrix factorisation.

We propose an autoencoder alternative to sig- nature extraction which we hypothesize will increase stability and confidence in the signa- tures. These new signatures will be used to diagnose ovarian cancer patients with homolo- gous recombination deficiency, a DNA deficiency that has been shown to be sensitive to PARP inhibitor treatment. Potentially, this test leads to improved identification of ovarian cancer patients who will respond to platinum treatment, a surrogate treatment for PARP inhibitors, which would indicate that the proposed test could successfully act as a predictive biomarker for PARP inhibitor treatment.

The project will deliver a pipeline for confident stratification of cancers based on mutational signatures, providing one step further towards personalised medicine for DNA repair-defi- cient tumours.

Ida Burchardt Egendal

Mutational Signatures Using Neural Networks for Robust Stratification of Cancer Patients

Abstract: Somatic mutations play an integral role in the development...

Paul Jeha

Position: Itô’s Formula as a Learnable Bridge Between Two SDEs

Categories: Fellows, PhD Fellows 2022

Location: Technical University of Denmark

Abstract:

Although the field of complex networks has been actively researched for several decades, higher-order networks describing group interactions have just recently gained special attention.

At the expense of the richer description of the interacting components, more complex mathematical tools taken from the field of topological data analysis need to be applied for their study. Despite numerous studies in this field, the structural dynamics and evolution of higher-order networks are still not well understood as of today.

Paul Jeha

Itô’s Formula as a Learnable Bridge Between Two SDEs

Abstract: Although the field of complex networks has been actively...

Meet

The DDSA Fellows

PhD Fellows 2024

Postdoc Fellows 2024

Abstract:

Postdoc Fellows 2023

PhD Fellows 2022

Contact:

Danish Data Science Academy

DDSA is funded by: