/Connect.Learn.Be Inspired.
Meet
The DDSA Fellows
Meet the visionary and creative-thinking PhD and Postdoc Fellows who has received a Fellowship from Danish Data Science Academy. In the sections below, you can learn more about the fellow and their research projects.
PhD Fellows 2024
Christian Mikkelstrup
Abstract:
I propose a project that combines Deep Learning (DL) and Graph Cuts for more accurate image segmentation methods of medical and biomedical 3D volumes. Current state-of-the-art segmentation research is dominated by Convolutional Neural Networks (CNNs) like U-net because they have high accuracy and are adaptable to almost any problem. However, they require large annotated 3D datasets and, since their segmentation is based on voxel-wise labeling, do not guarantee known topological features and object shapes. On the other hand, the Graph Cut is a classic segmentation method [16] based on graph theory and maximum a posteriori energy minimization. With graph-cut-based methods, we can model smoothness, surfaces, and object inclusion/exclusion, all without needing training data. It does, However, it requires spatial initialization and modeling of the image intensities to be used for the graph energies.
Anders Gjølbye Madsen
Abstract:
The proposed PhD project aims to enhance the trustworthiness and interpretability of machine learning models in healthcare by leveraging causal approaches. It addresses the need for models with human-interpretable, causal handles to ensure AI systems are reliable and understandable to professionals. The project proposes to investigate and develop methods to distinguish between predictive and explanatory features in ML models, emphasizing the importance of causality in model interpretability.
Mathilde Diekema
Abstract:
Cell-free DNA (cfDNA) in the bloodstream, which includes circulating tumor DNA (ctDNA) from tumors, is a promising biomarker for cancer detection and monitoring. However, the minute amounts of ctDNA, especially in early-stage cancers or small tumors, make detection difficult, often indistinguishable from sequencing errors. This PhD project proposes to introduce deep generative models to the ctDNA field to facilitate analysis and improve detection performance. Building on the DREAMS (Deep Read-level Modelling of Sequencing-errors) method, I will integrate a Deep Generative Decoder (DGD) to model both noise and cancer-related signals. Through latent-space sample representations, the DGD model is hypothesised to better capture the complex, patient-specific ctDNA signals and sample-specific noise patterns in whole genome sequencing (WGS) data. This project is supported by extensive, local WGS cfDNA datasets from colorectal cancer (CRC) patients, bladder cancer (BC) patients and healthy controls, offering a unique opportunity to refine and validate the proposed methods. It encompasses three primary aims: develop a DGD of sample-specific cfDNA error rates, quantifying ctDNA fractions using latent space modelling, and creating a multimodal ctDNA detection framework. These efforts promise to enhance the sensitivity and specificity of ctDNA detection, facilitating earlier cancer detection, improved tumor burden monitoring, and personalised treatment strategies.
Christina Winkler
Abstract:
Scientific simulations in the natural sciences are mathematical and computational models that are being used to represent physical processes and interactions from real-world phenomenon such as radiative transfer, cloud formation or turbulent mixing. Traditional equation-based simulation methods are usually captured by Partial Differential Equations (PDEs), which are expensive to solve numerically. Recently, deep learning-based methods have been developed for accelerating these simulations themselves, contributing to faster PDE solvers through deep learning based surrogate models. Since in simulation studies it is of interest to predict and generate possible futures, generative models pose an attractive model candidate. In this research proposal, we propose the use of normalizing flows, a class of invertible generative models which exhibit stable training procedure, efficient sampling and inference properties and allows for predictive uncertainty quantification while allowing for exact log-likelihood computation on continuous data. We hypothesize the method would be of orders of magnitude faster than traditional time-dependent PDE solvers, as well as performance gains in terms of computational efficiency and stability in terms of accuracy over long rollout periods.
Søren Vad Iversen
Abstract:
The aim of this project is the development of improved sampling algorithms for Bayesian inference. To achieve this we build upon the ideas presented in and, to create an efficient and robust sampling scheme, that works well on both high dimensional and multimodal distributions. As an initial application the project will develop a Bayesian model selection tool with an integrated goodness-of-fit test, that reveals how one might improve upon the model to describe the data even better. The goodness of fit test will be build upon the test represented in but using trigonometric functions as an alternative extension, to the exponential family used in.
The hope is that this leads to new avenues of applications for Bayesian inference, due to a sampling algorithm robust and efficient in a wider range of scenarios, such as the improved Bayesian model selection scheme of this project. Allowing the true potential of Bayesian inference to be available for the problems at hand.
Martin Sæbye Carøe
Abstract:
Computed tomography is a very widespread approach to visualize multi-phase materials in 3D. This project will explore novel mathematical methods that combine several modalities, in particular X-ray CT (XCT) and Neutron CT (NCT) data to improve the resulting reconstruction and subsequent classification of a sample. Currently, reconstruction methods that treat each data modality separately are being used, from which the reconstructions are then combined into a single segmentation. This does not take into account the different nature of the data. In the project, we will study variational regularization methods that integrate prior information and use statistical models for the measurement errors. We will also use a so-called material decomposition method that expands the data in terms of a basis function for each material.
Niklas Gesmar Madsen
Abstract:
Enzymes lie at the centre of a biosustainable and circular bioeconomy, but nature’s catalysts have a recursive complexity to them, often hidden in their dynamics. With the advances of genomics, sequence data is readily used to model and improve enzyme function. Similarly, advances in structural biology and geometric deep learning spark structurally driven enzyme engineering. Yet, the lasting frontier is to understand and design enzyme dynamics (i.e., the temporal dance), which is tightly correlated with enzyme functions (turnover, allostery, selectivity). Designing enzyme function via dynamics is critical for a wide range of enzyme-based applications in biotechnology.
Johanna Düngler
Abstract:
Machine learning models and algorithms are increasingly integrated into critical decision-making processes across various domains such as healthcare, finance, criminal justice, and more. The decisions made by these models can have profound consequences on individuals and society. Due to this, Trustworthiness is a critical facet of contemporary Machine Learning research and applications. This project aims to focus on the notions of Robustness, Privacy, and Fairness within ML systems, exploring the trade-offs and dependencies among them.
Kasper Fyhn Borg
Abstract:
The early 2020s have been defined by the concurrent global crises of the pandemic and climate change, characterized by complex interplays of causes, effects, and (real and potential) interventions. Communication about these crises reflect rich causal and counterfactual reasoning over collectively negotiated models of the world. Presently, the argumentative structure of collective discourse can only be studied qualitatively, which imposes limits on the generalizability and scalability of research findings, largely because the task of Causal Relation Extraction (CRE) at scale is underdeveloped in NLP, and non-existent for low-resource languages like Danish.
Causal language is a window into how humans reason causally and counterfactually, a capacity widely held to be the hallmark of human intelligence, and a key topic in research on science and crisis communication, mis/dis-information, and public trust and solidarity. Unlike many computational methodologies, our models and tools will be developed and fine-tuned through application to social scientific research questions. This integrated and research-guided approach ensures that model performance will be evaluated for explainability, interpretability, and robustness by domain-experts at every step. Our open source models and published results will have broad applicability for researchers across disciplines, as well as external stakeholders like policymakers and public health officials.
Sijia Feng
Abstract:
Postdoc Fellows 2024
Martin Rune Hassan Hansen
Samuel Matthiesen
Abstract:
Representation learning aims to transform data to extract relevant information for other tasks. This problem can be understood as being encompassed by generative modelling, which learns a mapping from a latent representation into a data point. As machine learning becomes more pervasive, it is critical to quantify confidence in the model behaviour in fields such as life sciences, security, and general decision making. In this project, we aim to address current limitations of modern approaches to fundamental problems of representation learning and generative modelling. We consider two related lines of research. The first aims to scale Bayesian inference to modern problems of generative modelling, enabling a principled approach to evaluate model behaviour with uncertainty estimates. The second is concerned with the geometry of the latent spaces of those models, allowing us to properly inspect and operate on them. We expect those models to be robust when put in scenarios different from those represented by the training data, and to allow sound analyses of high-dimensional, complex data to be conducted within their latent spaces. Primarily, we consider Gaussian process latent variable models (GP-LVMs) for both lines of research. These models are uniquely able to be employed in similar tasks to modern neural networks and, under certain conditions, have closed-form formulas for the expected metric induced in the latent space, an advantage over neural networks. As a starting point for research on scalability, we consider Laplace approximations that scale linearly with data size. A promising way to bring GP-LVMs to a modern setting could involve a linearised Laplace approximation of an autoencoder, which is based on a neural network, effectively transforming the generative part (decoder) into a GP-LVM. Furthermore, we intend to explore how to make use of the expected metric for GP-LVMs for larger problems. Their closed-form formulas can be inefficient to work with. Recent advances by modern automatic differentiation engines are a promising avenue for solving this, as the Jacobian of the model is usually needed for constructing the expected metric tensor. This requires careful reformulation of common operations on Riemannian manifolds. Together, the proposed lines of investigation aim to build scalable, uncertainty-aware generative models whose latent spaces are geometrically well-understood.
Benjamin Skov Kaas-Hansen
BECAUSE-ICU will build the first iteration of a data warehouse with large-scale, real-world data from Danish intensive care units, in an OMOP common data model. Then, we will use the data warehouse to replicate results of previous clinical trials, and predict the results of ongoing or imminent trials.
Madeleine Wyburd
Abstract:
Kazu Ghalamkari
This project develops an advanced tensor factorization that is convex, expressive, and robust. This can potentially lead to a paradigm shift towards solid data analysis to eliminate initial value dependency from downstream tasks of tensor decomposition and provide stable computational tools for the data science communities reliant on tensors.
Emil Michael Pedersen
Cross-Academy Fellows 2024
Malene Nørregaard Nielsen
Abstract:
Aim:
This project investigates the unexplored clinical potentials of photoplethysmography (PPG) as an assessment tool for patients with atrial fibrillation (AF). We aim to investigate (1) the impact of risk factors on PPG, (2) how AF-related hemodynamic changes are reflected in PPG, and
(3) how ablation treatment for AF affects hemodynamics.
Background:
PPG is a technique that uses light to detect volumetric changes in the peripheral vasculature. It is widely available in wearables and provides a more continuous signal than electrocardiography (ECG). In research, PPG has been used to detect AF with high accuracy comparable to ECG. PPG is less well characterized than ECG, and it is unknown how ageing, hypertension, diabetes, and other risk factors as well as hemodynamics relating to AF are reflected in the PPG. This project will generate important basic knowledge on the clinical use of PPG and at the same time investigate the hemodynamics of AF, the most common arrhythmia worldwide.
Methods:
We will develop deep neural networks (DNN) for detecting hemodynamical patterns related to AF based on PPG recordings and characteristics from three independent cohorts comprising >6500 patients. Specifically, we will apply a DNN to (1) use PPG signals to distinguish patients with a risk factor (e.g. diabetes) from a patient without, (2) investigate how the hemodynamic changes before, during and after AF in PPG signals, and (3) distinguish between a patient’s hemodynamical pattern before and after they have received ablation therapy with PPG signals. To allow for linkage between the PPG signal and the outcome, we will specifically develop and apply explainable AI (xAI) methods for PPG analysis. xAI allows for a visual interpretation ofthe otherwise hidden decision-making of the DNN and graphically depicts the linkage of the
signal to the outcome. xAI has previously been used with ECG analysis and in this project, we will develop the method for use with PPG signals for characterisation of hemodynamics associated with the risk factors, paroxysmal AF, and AF management.
Perspectives
This project will provide a novel understanding of PPG necessary for future clinical use and investigate unknown mechanisms of AF. Firstly, we will characterize the effect of prevalent risk factors on the PPG with huge implications for PPG algorithm development. We will also determine to what degree PPG may be used as a gatekeeper for further diagnostic work-up and reduce the number of unnecessary tests for the benefit of patients and society. Secondly, we will generate important knowledge on AF mechanisms and on how hemodynamics are reflected in PPG signals and our findings will be part of the scientific foundation necessary for the use of PPG in healthcare, whether driven by industry or academia. Finally, this project will help gain mechanistic information on ablation as a treatment for AF and might eventually help inform personalized treatment.
Shanshan He
Diabetic kidney disease (DKD) represents a major long-term complication of Type 2 Diabetes (T2D), increasing the risk of kidney failure and cardiovascular events. Yet, the relationship between T2D and DKD is complex, as it is difficult to accurately predict the degree of kidney damage a T2D patient will develop and whether it will eventually develop into DKD. This is largely driven by a lack of understanding of the precise molecular and cellular mechanisms underlying the association between DKD and T2D. This project aims to deepen our understanding of the development of DKD in T2D on a genetic and cellular level through the application of state-of-the-art single-cell multimodal sequencing assay and bioinformatics tools and deep learning models. By simultaneously profiling gene expression and genome-wide chromatin accessibility within the same kidney nuclei, we will construct a comprehensive molecular atlas derived from thirty kidney biopsies representing a spectrum of severity from non-diabetic kidney disease to DKD in T2D patients from the PRIMETIME2 Danish national cohort study.
This atlas will facilitate the generation of cell type-specific gene regulation networks and the integration of regulatory DNA atlases with disease genetic variants obtained from high-powered genome-wide association studies datasets. We will use this to calculate kidney cell type-specific polygenic risk scores (PRSs) to stratify large heterogenous patient groups and validate the predictive power of these cell type-specific PRSs in several large deeply genotyped cohorts.
Through this comprehensive analysis, we aim to gain novel insights into the shared genetic, cellular, and molecular basis of DKD and T2D. This understanding will enhance the prediction and precision treatment of DKD by stratifying the heterogeneous T2D patient group.
Jakob Nebeling Hedegaard
Abstract:
Automac Anomaly Detecon in Health Registry Data by Dynamic, Unsupervised Time Series Clustering
Denmark has established a wealth of health registries used to monitor the quality of health care. Although this resource has enormous potential, data has become so complex and highdimensional that important insights in quality of care and patients’ safety may go unnoticed. There is, thus, a need for a dynamic, automated algorithm capable of flagging growing anomalies in registry data, helping health care personnel to rapidly discover important divergencies.
In this project we will develop and test a new algorithm based on dynamic, unsupervised time series clustering with anomaly detection for health care data. At each time point, the algorithm will cluster patients (using, e.g., hierarchical, t-SNE, or autoencoder clustering) based on a patient trajectory metric (e.g., Hamming distance or optimal matching) and the development of anomaly clusters will be monitored by significant change in a cluster dissimilarity measure (e.g., Jaccard distance or MONIC). Thisalgorithm’s output will consist of summaries of detected anomalies in a form that allows for a quick assessment by relevant health care professionals. These summaries will be evaluated by a team of experts, and the algorithm will be tuned based on their input. The algorithm will thus learn, through supervision, to predict expert interests. The algorithm will be developed and tested on the Danish Diabetes Database (DDiD).
Such an algorithm would greatly improve the health care system’s ability to react timely on both positive and negative trends in quality of care. Furthermore, the algorithm will be developed in a disease independent fashion, such that it can be implemented more generally and potentially be used to monitor other areas in critical need of attention.
Manuel Mounir Demetry Thomasen
Abstract:
Background Diabetes is a complex chronic condition with severe potential complications, which poses a huge burden on people with the condition, their families, and the healthcare sector. Risk assessment tools facilitating early detection of complications are crucial for prevention and progression management. Progression of diabetes and corresponding physiological changes affect several organs involved in the production of voice and speech. Vocal biomarkers are signatures, features, or a combination of features from the audio signal of the voice, that is associated with a clinical outcome and can be used to monitor patients, diagnose a condition, or grade the severity of a disease. Vocal biomarkers for diseases affecting the nervous system are well-established, but there is also some evidence for a potential in diabetes and cardiovascular research. Therefore, this project focuses on cardiovascular disease (CVD), neuropathy, and diabetes distress as clinical outcomes. Previous studies have been rather small, therefore there is also a need to establish new data collection with a focus on diabetes-related complications.
Aims This interdisciplinary project aims to develop and integrate novel vocal biomarkers in risk assessment of diabetes-related complications. The work will involve (1) data collection, creating new resources for further research in an emerging field, and (2) development of machine learning methods and models that might reveal important clinical knowledge about diabetes-related complications: cardiovascular disease, neuropathy and diabetes distress.
Perspectives The proposed project will contribute with valuable insight on how voice data can be used in risk assessment of diabetes-related complications. The project is expected to generate both methodological results (e.g. pre-trained models, new data sources for machine learning research) and clinically relevant tools (e.g. vocal biomarkers) that might contribute to innovative ways of monitoring diabetes-related complications in the future.
Jasmin Hjerresen
Abstract:
Cardiometabolic diseases including type 2 diabetes (T2D), cardiovascular disease (CVD), and obesity pose a growing global health problem, and a decay in the public metabolic health in Greenland is associated with westernization of diet and lifestyle. The genetic architecture of the modern Greenlandic population is shaped by its demographic history, geographic isolation in an Arctic
climate, and small population size, resulting in strong genetic drift and a high frequency of highimpact gene variants. Although genetic variants with high impact on metabolic health have already
been described, the genetic regulation of the plasma lipidome and its link to cardiometabolic diseases is poorly understood.
Using a state-of-the-art high-throughput mass spectrometry-based lipidomics, we aim to integrate plasma lipidomics data and genetic data from 2,539 Greenlandic individuals to better understand the link between lipid species and metabolic health. A study visit to Swedish University of Agricultural Sciences for collaboration will provide this project with nuclear magnetic resonance lipidomics
analysis, contributing with quantitative and qualitative knowledge of the study population and the potential identification of novel compounds. With genome-wide association studies, mapping of lipid quantitative trait loci (lQTLs), and colocalization analyses, we will examine the cross-sectional associations between lipid profiles, registry-based data on cardiometabolic outcomes, and genetic data to identify prognostic biomarkers and investigate biological pathways related to T2D, CVD, and
obesity. We hypothesize to see changes in the plasma lipidome in genetic loci linked to cardiometabolic disorders due to genetic drift of the Greenlandic population accompanied by westernized diet and lifestyle.
This project could offer novel insight into genetic etiology of cardiometabolic diseases to improve our understanding of molecular disease mechanisms and reveal novel targets for disease treatment and prevention in a broader perspective. The discovery of novel high-impact genetic variations associated with altered lipid profiles can contribute to understanding of metabolic health in Greenland and highlights the implications this research has for genetic precision medicine
Amalie Koch Andersen
Abstract:
More than 650 million people suffer from prediabetes worldwide and the prevalence is increasing rapidly. A large part of these people will eventually develop microvascular and macrovascular complications generating a large economic burden on society. To prevent or delay onset of these complications, both lifestyle and pharmacological interventions are necessary. However, treatment tools or guidelines specifically for prevention of complications for this group does not exist in general practice. To address this challenge, the project seeks to improve the management of people with prediabetes by developing a decision support system to be implemented at the general practitioner. Based on a prediction of the personalized risk of micro- or macrovascular complications and a risk stratification, individuals with high- risk profiles will be identified. Additionally, different scenarios with lifestyle and pharmacological interventions will be simulated. This novel prediabetes risk engine tool will support informed treatment and early prevention strategies at the general practitioner, aiming to prevent or delay the onset of complications. Data from clinical studies and Danish national registers will be analyzed using data science techniques to identify patterns which are important for prediction of diabetes-related complications. Additionally, Artificial Intelligence including machine learning methodology will be used to develop the prediction model. No studies regarding prediabetes have investigated development and implementation of a flexible model allowing usage with only a limited amount of clinical data, and with a possibility of entering further data to increase precision of the risk estimate. Therefore, this project will focus on development of a flexible predictive model aimed at estimating the personalized risk of micro- or macrovascular complications among individuals with pre-diabetes.
PhD Fellows 2023
Jakob Lønborg Christensen
Abstract:
Image segmentation is an important research area that has, due to deep learning, seen great advances in recent years. There are still problems to solve, especially when annotated data is scarce. We propose a PhD project aiming to unify agnostic segmentation models with the diffusion process. We argue this is a good idea since many of the ideas in diffusion can be applied to segmentation.
Recent diffusion model developments have been focused largely on the text-to-image domain. Adapting these methods to segmentation can give rise to useful models with human-in-the-loop or few-shot capabilities. The PhD has the potential to be valuable for collaborators of the Visual Computing section at DTU, while also having the potential for larger impacts in the research area as a whole. The applicant, Jakob Lønborg Christensen, is an honours programme student at DTU with multiple peer-reviewed publications. This PhD project would benefit significantly from not being bound to a specific application area or a specific dataset.
Javier Garcia Ciudad
Abstract:
The purpose of this project is to expand our knowledge about the electrophysiological features of sleep, with a particular focus on establishing links and differences between human and mouse sleep in both healthy and narcoleptic phenotypes. Narcolepsy is a sleep disorder characterized by excessive daytime sleepiness. Mouse models are often used to study narcolepsy by introducing specific pathological changes with gene manipulation techniques. Both in humans and mice, sleep and narcolepsy are often studied using electrophysiological signals. Still today, these signals are mainly analyzed by manual annotation of different sleep stages. In recent years, deep learning scoring models have been introduced, though without becoming widely implemented.
These models apply just to humans or just to mice, which is partly motivated by a lack of understanding of how much human and mouse sleep have in common. Finding similarities between both would support the development of common scoring models. More importantly, it would allow causal links to be made between the specific pathological changes modeled in mice and the human disease, which is one of the major challenges in narcolepsy research. In addition, finding electrophysiological signatures of narcolepsy or other factors such as age or gender would enhance our understanding of narcolepsy and sleep.
For this purpose, sleep signals will be studied using state-of-the-art deep learning methods. Sleep scoring models based on transformers and convolutional and recurrent neural networks will be studied to investigate how well they translate between the human and mouse domain. In addition, representation learning using variational autoencoders and contrastive learning techniques will be employed to learn compact representations of sleep signals, with the goal of providing species-invariant representations and identifying individual variabilities from the signals. The learned representations will be projected to lower- dimensional latent spaces, in which evaluating the distance between groups. Finally, explainable AI techniques will be investigated to extract insights from the models used, which could reveal EEG biomarkers of species, disease state and other individual variabilities.
Christoffer Sejling
Abstract:
Diabetes and prediabetes are increasingly prevalent conditions in modern society, both of which are associated with numerous health hazardous conditions such as obesity, hypertension, and cardiovascular disease. In itself, type 1 diabetes (T1D) is a life changing diagnosis, forcing a need for constant health awareness. When dealing with these challenges, a continuous glucose monitor (CGM) is a vital tool that helps patients evaluate their own health and helps inform clinical decision making in a cost effective manner. Use of CGM devices is therefore becoming more and more common in diabetes clinics around the world, where data from CGMs are collected and analyzed with the objective of optimizing patient care. The increasing adoption of CGMs brings about a huge potential for improving care by developing a data-driven methodology that can be used to assess the CGM data. However, since only simplistic methods based on different summary statistics have been attempted in clinical practice, we still need to uncover the full potential of the information production in CGM measurements.
In this project, we aim at further developing the statistical methodology for drawing out information from CGM trajectories by making use of complex features such as slope, locality, and temporality. In particular we seek to carry out prediction and statistical inference for clinically relevant outcomes on that basis. Additionally, we aim at estimating causal effects, which may help guide clinical decision making. As outcomes, we consider the occurrence of entering and leaving a state of remission as well as the occurrence of entering a state of hypoglycemia for T1D patients at Steno Diabetes Center Copenhagen. We specifically seek to enhance performance in the prediction of these clinical occurrences and the identification of clinically meaningful attributes by taking advantage of the longitudinal calendar order of the observed CGM trajectories for each patient.
In summary, we aim at obtaining a characterization of CGM trajectory shapes that provides accessible, usable, and valid information, on which clinicians may base their assessments and decisions.
Arman Simonyan
Abstract:
Two-thirds of human hormones act through ~800 G protein-coupled receptors (GPCRs). The vast majority (71%) of these hormones are peptides or proteins, which also account for an increasing share of drugs. The study of peptide-receptor recognition is thus essential for understanding physiology, pathology and for drug design.
This project aims to solve the modeling problem of peptide-receptor recognition by leveraging machine learning methods and unique data from the field hub GPCRdb. I will build predictive graph neural network models representing residue interaction networks across the receptor-peptide interface. The models will utilize attention-based transformer and LSTM architectures which have shown great promise in drug-target interaction prediction and de novo-drug design. The models will be trained on a unique data representation, storing data for individual residues rather than the overall protein. This will allow peptide data to be inferred across conserved residues in different receptors – enabling use on receptors not targetable with classical methods.
The trained models will be used in three applied aims to: (1) discover peptide hormones by matching the library of predicted physiological peptide hormones to their cognate receptors with unknown physiological ligand and function; (2) identify peptide probes by matching pentameric peptide library to understudied and drug target receptors; and (3) holistically engineer probes for those receptors residue-by-residue. The in silico discovered probes will be tested in vitro by pharmacological collaborators. In all, this will let me discover novel hormones and engineer new probes, enabling functional characterization of understudied receptors that cannot be targeted with current techniques.
This project has the potential to uncover mechanisms of peptide-receptor recognition underlying physiological, sensory, and therapeutic responses. This will lay the foundation for exploring uncharted receptor functions and designing better drugs. Given this and that our approach will be applicable
Thomas Gade Koefoed
Abstract:
Insulin resistance (IR) is a key characteristic of type 2 diabetes (T2D) – a common and severe condition characterized by dysregulated blood glucose levels. Despite considerable efforts to map the complex characteristics of IR and T2D, detailed characterizations of IR in some important metabolic tissues, such as skeletal muscle, are still lacking.
In this project, we propose to use a high-throughput, state-of-the-art single-nucleus sequencing assay to gain cutting-edge biological insight into the transcriptomic, epigenetic, and cellular characteristics of IR in skeletal muscle. Furthermore, we will use the generated data to investigate the pivotal role of this tissue in the development of IR and T2D. Specifically, we will determine which muscle cell types mediate the most heritable risk of IR and T2D, potentially elucidating novel targets for treatment. Finally, we will investigate whether cell-type-specific polygenic risk scores can enable better prediction of a patient’s disease comorbidities and drug responses when compared to the use of traditional, non-cell-type-specific polygenic risk scores. No such analysis has yet been performed for human skeletal muscle, and the resulting stratification of heterogenous IR and T2D patient groups would constitute an important advancement in precision medicine.
The single-nucleus assay will be performed by the Hansen group for Genomic Physiology and Translation in collaboration with the Single-Cell Omics Platform at the Novo Nordisk Foundation Center for Basic Metabolic Research. The full dataset will be generated before the start of the project in Q3 2023, at which point the PhD-candidate will start computationally analyzing the data, drawing upon state-of-the-art bioinformatic tools and machine learning models. Importantly, the proposal is based on proof-of-concept data from one skeletal muscle sample, which is included in the project description. Additionally, the project is based on multiple national and international interdisciplinary collaborations, including supervisors from both clinical and technical backgrounds and a six-month research stay at the Broad Institute of Harvard and MIT, Boston, USA.
Finally, it should be noted that the bioinformatic analyses in this project can be generalized to any heritable disease and tissue. We, therefore, believe that the knowledge and methodological advancements gained from the project will have a wider clinical impact beyond skeletal muscle and metabolic diseases.
Jette Steinbach
Abstract:
The ability to predict disease risk and identify individuals at high risk for developing a certain disease is fundamental to modern healthcare, since it enables the implementation of preventive measures and personalized treatments. Polygenic scores (PGS) have received attention for their promise to improve clinical prediction models. Recently, electronic health records (EHR) have also proven to enhance prediction accuracy. However, the accuracy of both PGS and EHR in clinical prediction models is impacted by individual genetic, environmental and diagnostic heterogeneity, which can lead to racial, gender, and ancestry-based biases. It is important to understand and measure the impact and severity of these types of heterogeneities, in order to develop more inclusive, accurate and robust prediction models. These models need to be evaluated and replicated across cohorts and in individuals of different genetic ancestries.
The proposed PhD project intends to address this by evaluating the impact of these heterogeneities on the predictive performance of PGS, EHR and informed family history (FH) within and across cohorts and ancestries. It will do so by studying the effect of genetic and environmental heterogeneity on the prediction accuracy for numerous health outcomes, characterizing differences in EHR across populations, and providing more robust prediction models that incorporate EHR, PGS and FH.
This PhD project aims to contribute with high-quality research to the field of psychiatric epidemiology and psychiatric genetics by providing insight into the predictive accuracy of prediction models across ancestries and cohorts. It intends to provide a deeper knowledge about the impact of genetic and environmental heterogeneity on the predictive performance of PGS, informed FH and EHR, and may serve as a guide for future research on the development of clinical prediction models.
Mikkel Runason Simonsen
Abstract:
For a wide range of medical conditions, prognostic models are used routinely to inform patients about their outlooks, guide treatment choice, and recruit patients into clinical trials. However, many prognostic models are developed and used only knowing the discriminatory capacity of the model, and not the model calibration and clinical utility. This PhD program aims to develop a method that can improve calibration, and thus clinical utility, of prognostic models, such that they will apply in heterogenous clinical settings across borders and continents. Additionally, new prognostic models for specific hematological cancers that outperforms existing models will be developed.
The project consists of two elements. Firstly, we will develop a new methodology to improve external validation of prognostic models particularly aiming at improving model calibration. This is of particular interest as new prognostic models developed in a Danish setting may not perform as well in other countries with different clinical standards, background mortality, and culture. Secondly, we will develop new prognostic models within hematological cancers using the newly developed methodology in combination with machine learning and artificial intelligence (AI) approaches. Denmark holds numerous comprehensive clinical registers, which the model development will be based on.
Development of a methodology for improving performance, particularly model calibration, of prognostic models will allow for the development of prognostic models that perform well in a variety of economic, cultural, and clinical settings. Improving the precision of prognostic models will provide health care planners, patients, and clinicians with a better foundation for making important clinical decisions. For instance, accurate prognostic models for hematological cancers can identify high-risk patients more accurately at the time of diagnosis, which can be used to guide treatment or recruit patients for clinical trials. Identification of low-risk patients is also important as these will be candidates for de-escalating treatment, which can avoid severe side effects from the treatment.
Sebastian Loeschke
Abstract:
Efficient and realistic image rendering (IR) has long been a focus of research. Machine learning (ML) techniques for IR have enabled the creation of complex and photorealistic images. Despite recent advances, these techniques are often slow and memory-intensive, limiting their practical use.
This Ph.D. proposal aims to explore the potential of quantum-inspired tensor network (TN) methods for IR tasks, with the goal of reducing memory and computational costs. TNs are versatile and powerful scientific simulation tools that have been successful in simulating strongly correlated quantum many-body systems and quantum circuits. TNs have also been used to compress deep neural networks, leading to significant memory savings in ML applications. However, TNs have not been utilized as extensively as neural networks in ML, and the development of tools and techniques for training them has been limited.
This project will develop novel algorithms and techniques that leverage TNs’ full capabilities in an ML and IR setting to achieve real-time or animated 3D IR at high precision. The project will identify promising TN embeddings for images and scenes, and develop efficient learning algorithms for constructing them. Specific projects include exploring discrete vs. continuous TN embeddings, upsampling methods, and incorporating TNs into normalizing flows and diffusion models to improve representational power and inference time.
This project has the potential to significantly contribute to the fields of ML, IR, quantum computation, and life sciences, which heavily rely on the analysis of large datasets. By developing efficient IR techniques, this project aims to make IR more practical and accessible, benefiting fields such as medical imaging, gaming, virtual and augmented reality, and robotics. Additionally, TN methods have the potential to significantly reduce the carbon footprint of ML applications by developing more efficient algorithms that can process large datasets with fewer computational resources. This will not only benefit the environment but also democratize ML by making it more accessible to a wider range of individuals. In addition, the use of TNs allows for better explainability compared to deep learning models. Lastly, this project will contribute to the collaboration between the quantum and ML communities and also help map out the landscape where TN algorithms provide an advantage, paving the way for future advancements in quantum-native algorithms.
Mikkel Werling
Abstract:
In recent years, artificial intelligence has shown remarkable results in computer vision, natural language processing, and image generation. But in many domains within health, progress in predictive models has stagnated. Algorithms often show (1) low prediction accuracies and (2) poor generalizability beyond training data. Low prediction accuracies are largely the results of ubiquitous low-resource settings in health and an inability to incorporate data from different sources (e.g., different countries and different data modalities). The problem of generalizability is mainly due to algorithms being trained on data from a single site but rarely benchmarked on external data, leading to overfitting and vulnerability to data shifts.
In this project, we address the problem of low prediction accuracies and generalizability in the specific domain of chronic lymphocytic leukemia (CLL), where progress in prognostic models has stagnated.
We increase prediction accuracies by developing a novel meta-learning framework capable of handling multiple data modalities and multiple outcomes. This allows us to include multiple data sources as well as combine information from related diseases (multiple myeloma and lymphoma primarily) (Figure 1A), drastically reducing the number of samples needed for state-of-the-art performance.
We address the problem of generalizability by spearheading an international collaboration across four different countries. By combining federated learning with a model capable of domain adaptation, we overcome the issue of heterogeneity in the data from different countries thereby producing internationally robust results (Figure 2B). We establish a global benchmark, allowing us to assess the international generalizability of our model.
By providing a proof-of-concept of the value of learning from multiple diseases, we revolutionize how we think about patient data in health. Using CLL as a litmus test, this project will generate a roadmap for overcoming some of the biggest barriers in health machine learning (hML) and achieving state-of-the-art performance even in low-resource domains.
Asbjørn Munk
Abstract:
This research proposal is a collaboration between the Machine Learning Group at University of California, Berkeley, Bruce Fischl’s group at Harvard Medical School, and the Medical Image Analysis Group at University of Copenhagen.
Deep learning has shown tremendous success in brain image analysis, aiding clinicians and researchers in detecting and treating a wide range of diseases from various data sources such as CT and MR images. A common task is to perform segmentation on images. However, the field is fundamentally limited by a lack of labeled training data and substantial variations in the data available. As a result of this, models often exhibit a lack of robustness to changes in equipment, patient cohorts, hospitals, and scanning protocols.
Because segmentation models have very large hypothesis spaces, existing domain adaptation theory and methodologies have failed to alleviate these problems. Current methodologies are either not grounded in theory or impractical to apply to segmentation models, where the large hypothesis spaces make it intrinsically difficult to overcome numerical issues or achieve noteworthy performance.
To push forward the field of brain image analysis, there is a need for theoretically well-founded domain adaptation methods. This project aims to work towards such methods, by boldly conducting theoretical work with the world-leading machine learning group at Berkeley, and apply this work to brain image segmentation, in collaboration with Bruce Fischl’s world-leading group at Harvard Medical School. The project is to be centered at the Medical Image Analysis group at University of Copenhagen, which is internationally recognized for applying cutting-edge machine learning practices to medical image analysis problems.
If successful, this project will lead to more robust models, a fundamental contribution towards utilizing the vast amount of medical images which is being produced at hospitals worldwide. This work will contribute towards providing technology for improving fundamental brain research as well as.
Postdoc Fellows 2023
Laura Helene Rasmussen
Abstract:
Arctic winter climate is rapidly changing, with more variable snow depths, spring snowmelt timing, and more frequent midwinter thaw events. Less predictable conditions disrupt ecosystem balances and development in Arctic communities, and understanding winter variability across the Arctic and its influence on climate the whole year is needed to mitigate consequences of changing winters. However, access to in situ measured data has been extremely limited and scattered in local databases. Hence, cross-Arctic winter studies are few and based on remotely sensed data with larger spatial and temporal coverage, but less local sensitivity, and the winter contribution to annual average temperature change has not been investigated across the Arctic.
In this project, we 1) obtain, clean and standardize in situ soil surface temperature, snow depth and soil moisture data from climate monitoring programs across the Arctic and create a unique database with cross-Arctic in situ winter climate data from the last appr. 30 years. We use this dataset to 2a) estimate the accuracy of remotely sensed soil surface temperature, snow depth and soil moisture data using the regression model with the best fit, and quantify the bias, for each major Arctic region. We further 2b) construct an open access Winter Variability Index (WVI) for each major Arctic region based on the winter phenomena (average snow depth, snowmelt date, frequency of winter thaw events) that are most important drivers of a clustering analysis such as PCA, hierarchical clustering or autoencoders. Finally, we 3) use the change in WVI and in annual mean temperatures for each decade in a function-on-function regression analysis, which will quantify the contribution of winter variability change to annual average temperature changes in each Arctic region.
The project will produce a comprehensive dataset with potential for further research and will improve our region-specific understanding of remotely sensed data accuracy, which is key for confidence in climate system modelling. The WVI allows scientists or local communities to classify Arctic winter data within a quantitative framework of pan-Arctic winter variability also in the future, and to understand how important changes in winter variability is for Arctic climate changes the whole year.
Beatriz Quintanilla Casas
Abstract:
Today’s design and production of food products are still based on human artisan skills, especially when it comes to high-quality products where blending of raw materials is key. The development of new data science tools plays a key role on this food transition, as they can allow to comprehensively exploit the current knowledge while uncovering new connections. Therefore, the proposed project named EXPLOGA – Exploratory Gastronomy pursues to improve food design and production practices, in order to make them more efficient and sustainable, by means of developing new scientific data tools.
These new tools will be able to convert food flavour measurements into chemically and gastronomically well-defined information, through automated untargeted profiling of flavour data as well as advanced text analysis of the existing flavour information. EXPLOGA represents the first level of a new field of research we name functional gastronomy approach, which aims to use data science to better understand the influence of raw materials and processing techniques on the final food products in a broad sense.
This project will be carried out at the Chemometrics group at the Department of Food Science (University of Copenhagen), supervised by Prof. Rasmus Bro. It will also include a three-months international stay at the Norwegian Food Research Institute (NOFIMA).
Dustin Wright
Abstract:
Science reporting is not an easy task due to the discrepancy between scientific jargon and lay terms, as well as a discrepancy between the language of scientific papers and associated news articles. As such, not all scientific communication accurately conveys the original information, which is exemplified by skewed reporting of less technical topics and unfaithful reporting of scientific findings. To compound this problem, the average amount of time journalists can spend on individual articles has decreased due to funding cuts, lack of space, and increased commercialization. At the same time, the public relies on the media to learn about new scientific findings, and media portrayal of science affects people’s trust in science while at the same time influencing their future actions [7,26,27].
My project proposes to develop natural language processing (NLP) tools to support journalists in faithfully reporting on scientific findings, namely, tools for extracting key findings from scientific articles,, translating scientific jargon into lay language, and generating summaries of scientific articles in multiple languages while avoiding distortions of scientific findings.
In two recent studies which I led [20,21], we investigated automatically detecting exaggeration in health science press releases as well as general information change between science reporting and scientific papers, and found that large pre-trained language models can be successfully exploited for these tasks. This project will leverage my previous research and will be much more ambitious, focusing on: 1) detecting distortions between news articles and scientific articles in different languages and across multiple areas of science; 2) using a model which can detect such distortions to automatically generate more faithful news articles; 3) analyzing texts in the difficult domains of medicine, biology, psychology, and computer science research, which I have worked with previously and which garner some of the most media attention. This will result in trained models which can be used as writing assistants for journalists, helping to improve the quality of scientific reporting and information available to the public. In addition, the project will involve international collaboration with the University of Michigan, including a research stay in order to leverage their expertise and resources, as well as develop my competencies as a researcher.
Ignacio Peis Aznarte
Abstract:
Inference-friendly deep generative models such as Variational Autoencoders have shown great success in modelling incomplete data. These models typically infer posteriors from the observed features and decode the latent variables to impute the missing features. Recent deep generative models are well suited for modelling structured data like images, sequences, or vector-valued numbers, and they use neural architectures specifically tailored to the data type. Unfortunately, using these networks for grid-type data necessitates pre-imputation methods, such as zero-filling missing patches, leading to biased inference.
In contrast, Implicit Neural Representations (INRs) model complex functions that map coordinates to features in a point-wise setting using feedforward neural networks, independently of the data type and structure. As a consequence, they infer knowledge only from observed points, thus overcoming the aforementioned bias. Although Markov Chain Monte Carlo (MCMC) methods have been widely used to improve inference in classical deep generative models of structured data, their effectiveness in models of INRs is still an open research question.
My proposed project aims to revolutionize deep generative modelling by leveraging the power of Implicit Neural Representations (INRs) to model incomplete data without introducing any bias. By i) creating novel deep generative models of INRs, and ii) proposing novel MCMC-based inference methods for these models, we can overcome the limitations of existing techniques and open new directions for using MCMC-based inference in generative models of INRs. These groundbreaking contributions have the potential to transform the field of deep generative modelling and have significant implications for how they handle missing data.
Luigi Gresele
Abstract:
Representation learning and causality are fundamental research areas in machine learning and artificial intelligence. Identifiability is a critical concept in both fields: it determines whether underlying factors of variation can be uniquely reconstructed from data in representation learning, and specifies the conditions for answering causal queries unambiguously in causal inference. Causal Representation Learning (CRL) combines these two fields to seek latent representations that support causal reasoning. Recent theoretical advances in CRL have focused on the identifiability of a ground truth causal model. In this research proposal, I present two projects aimed at investigating previously unexplored aspects of CRL.
The first project aims to challenge the assumption of a unique ground truth causal model, by acknowledging that the same causal system can be described using different variables or levels of abstraction. To address this, we plan to investigate novel notions of identifiability, where the true model is reconstructed up to classes of causal abstractions consistently describing the system at different resolutions. We will also search for conditions under which these models can be learned based on available measurements. By doing so, we aim to clarify the conceptual foundations of CRL and inspire the development of new algorithms.
The second project aims to investigate latent causal modelling in targeted experiment, exploiting the rich experimental data and partial knowledge available in scientific domains to refine the CRL problem. Specifically, we will focus on neuroimaging experiments with treatment and control groups, with the objective of isolating the impact of covariates specific to the treatment group on functional brain data, disentangling it from responses elicited by the experimental protocol, shared across both groups. An additional difficulty stems from the variability in the coordinatizations of brain functional activities across different subjects due to anatomical differences. We plan to extend our previous work on group studies in neuroimaging to address these challenges. The outcome of this project could have a significant impact on scientific applications of machine learning, also beyond cognitive neuroscience.
In summary, my proposed research projects have the potential to advance the state-of-the-art in Causal Representation Learning, clarifying its conceptual foundations and enabling its application to real-world problems.
Daniel Murnane
Abstract:
The search for new physics beyond the Standard Model at the Large Hadron Collider (LHC) at CERN has been an elusive quest, despite the billion-euro machinery and extremely sensitive detectors used in the experiment. To overcome this obstacle, I propose a project to develop a novel machine learning (ML) approach called a Physics Language Model (PLM).
The PLM is a graph neural network (GNN) that maintains multiple scales of information about the energy deposits across the ATLAS detector located at the LHC. Instead of discarding fine details as is currently done, the PLM uses a hierarchical structure to pay attention to the most relevant scales and features of the physics data. This approach can also be trained on a variety of physics tasks and, in other domains such as protein property prediction, has been shown to outperform single-task models. Novel developments in the field of high energy physics (HEP) should be expected to feedback to improve Biological and Chemical Language Models.
The current HEP paradigm is to work on a discrete task in the physics analysis chain, using only the scale and granularity of the data produced in the previous stage. Modern ML models, and large language models (LLMs) such as GPT in particular, are a complete inversion of this paradigm. They instead gain expressivity from learning emergent patterns in the fine details of many datasets and tasks. In my role as Machine Learning Forum Convener for ATLAS, and with current collaborations with Berkeley Lab, DeepMind, Columbia University, Copenhagen University and Georgia Tech on this topic, I believe the time has come to use the available data, physics tasks, and huge compute availability to build a prototype PLM.
The PLM could greatly increase the discovery power for new physics at the LHC by reviving the data that is currently discarded. This is a unique opportunity, as algorithm choices for the High Luminosity LHC (HL-LHC) upgrade will be finalized within 18 months. If trends in natural language ML can be captured in physics, a PLM can also be expected to grow exponentially in power with increasing dataset and model size.
PhD Fellows 2022
Paul Jeha
Abstract:
Although the field of complex networks has been actively researched for several decades, higher-order networks describing group interactions have just recently gained special attention.
At the expense of the richer description of the interacting components, more complex mathematical tools taken from the field of topological data analysis need to be applied for their study. Despite numerous studies in this field, the structural dynamics and evolution of higher-order networks are still not well understood as of today.
The goal of the proposed research project is to build topological models to detect and predict the structural dynamics of real-world higher-order networks. Among others, these models could shed light on the evolution of neural networks such as the human brain, the dynamics of scientific collaborations, or the prediction of group relationships in social networks.
Fabian Martin Mager
Abstract:
Our brain is the most central part of the nervous system and crucial to our health and wellbeing. In Denmark, mental disorders make 25% of the total disease burden, with yet increasing prevalence. Across many medical domains, magnetic resonance imaging (MRI) is a widely used tool to study the anatomy and physiology of soft tissue, e.g. the brain, and is applied in diagnosis and monitoring of diseases, as well as a tool to investigate their underlying mechanisms. In psychiatry, research has yielded substantial evidence for structural brain changes at a group level, however these are typically subtle and currently, there is no clinical benefit from MRI for the individual patient.Previous research aiming to identify brain aberrations in patients with neuropsychiatric disorders struggle with relatively small and often inhomogeneous samples paired with complex clinical traits and weak pathological signals.
To unravel the intricacy of mental disorders and the brain, one approach is to apply powerful state-of-the-art machine learning algorithms, such as deep neural networks (DNNs). Large DNNs are able to extract high level features of images and other signals and able solve sophisticated tasks, outperforming traditional machine learning methods by far. On the downside, a DNN requires a large amount of ‘labelled data’, e.g., where each brain image has a meaningful notation, such as ‘patient’ or ‘control’. The amount of labelled data needed to train such a model is currently not available in conventional psychiatric research. In contrast, ‘unlabelled data’, e.g. normative brain images independent of a certain class or group, are often generously and publicly available. In the field of machine learning, scarcity of labelled data and richness of unlabeled data has given rise to self-supervised learning paradigms. In self-supervised learning one exploits rich unlabeled data to learn a general intermediate representation of the matter of interest. Scarcelabelled data is used efficiently to fine-tune the intermediate representation to a specific task of interest.
The aim of this project is to develop a self-supervised DNN model of the brain using MRI data of large international, high-quality databases. This model will then be fine-tuned using highly specific data from psychiatry to address specific research questions regarding the mechanisms of aberrant neurodevelopment. We believe a self-supervised model is more robust and able to learn more meaningful features compared to conventional models. To explore these features and their relation to clinical traits present in psychiatric patients, we want to employ explainable artificial intelligence techniques.
To sum up, we want to use self-supervised learning paradigms and utilize its efficient use of scarce labelled data to develop a state-of-the art DNN model of brain images, bringing neuropsychiatric research to the forefront of machine learning research.
Note: Since the date of recording the video, Fabian has chosen to adjust the scope and title of his research project, the title in the written article is current and correct
Emilie Wedenborg
Abstract:
Real World Data (RWD), such as electronic medical records, national health registries, and insurance claims data provide vast amounts of high granularity heterogeneous data. An international standard (OMOP) has been developed for health data and accelerating evidence generation from RWD. EU has recently adopted the same standard for the European Health Data & Evidence Network (EHDEN), the largest federated health data network covering more than 500 million patient records. This allows standardization of datasets across institutions in 26 different countries, but a major data science challenge remains on how to tackle the volume and complexity of multimodal data of such magnitude.
The aim is to develop easily human interpretable tools to analyse RWD to extract distinct characteristics enabling new discoveries. The project includes a key industrial collaborator, H. Lundbeck A/S, that will provide additional guidance, contacts, and access to large sets of RWD in the OMOP format.
The project will focus on a prominent data science methodology called Archetypal Analysis characterized by identifying distinct characteristics, archetypes, and how observations are described in terms of these archetypes, thereby defining polytopes in high-dimensional data. This project will develop tools for uncovering such polytopes in large, high-dimensional, heterogenous, noisy, and incomplete data. We will develop Bayesian modeling approaches for uncertainty and complexity characterization, data fusion for enhanced inference, and deep learning methods to uncover disentangled polytopes.
The tool will advance our understanding of RWD and will accelerate real world evidence generation through the identification of patterns in terms of archetypes. Furthermore, trade-offs within archetypes can fuel personalized medicine by defining a profile of the individual patient in terms of a soft assigned spectrum between archetypes. We hypothesize this characterization has important use advancing our understanding of subtypes and comorbidities within different neurological and psychiatric disorders.
Gala Humblot-Renaux
Abstract:
Human perception is inherently uncertainty-aware (we naturally adapt our decisions based on how confident we are in our own understanding) and multimodal (we seldom rely on a single source of information). Analogously, we argue that trustworthy computer vision systems should (at the very least) (1) express an appropriate level of uncertainty, such that we can reliably identify and understand their mistakes and (2) leverage multiple complementary sources of information, in order to be sufficiently well-informed. While state-of-the-art deep neural networks (DNNs) hold great potential across a wide range of image understanding problems, they offer little to no performance guarantees at run-time when fed data which deviates from their training distribution.
Reliably quantifying their predictive uncertainty in complex multimodal computer vision tasks remains an open research problem, yet will be a necessity for widespread adoption in safety-critical applications. The aim of this project is therefore to conduct basic research in probabilistic deep learning and computer vision, investigating how uncertainty can be modelled and extracted in multimodal DNNs for image classification and segmentation.
We will adopt approximate Bayesian inference methods to separately capture data uncertainty and model uncertainty not only at the final prediction stage, but also in the intermediate feature fusion process, in order to adaptively weigh the contribution of each modality. We will develop novel uncertainty-aware deep fusion methods, and study them on real-world computer vision tasks across a broad range of high-stakes domains including multimodal medical image analysis. Our research will be an important step towards improving the transparency and robustness of modern neural networks and fulfilling their potential as safe, trustworthy decision-support tools.
Peter Laszlo Juhasz
Abstract:
Although the field of complex networks has been actively researched for several decades, higher-order networks describing group interactions have just recently gained special attention. At the expense of the richer description of the interacting components, more complex mathematical tools taken from the field of topological data analysis need to be applied for their study. Despite numerous studies in this field, the structural dynamics and evolution of higher-order networks are still not well understood as of today.
The goal of the proposed research project is to build topological models to detect and predict the structural dynamics of real-world higher-order networks. Among others, these models could shed light on the evolution of neural networks such as the human brain, the dynamics of scientific collaborations, or the prediction of group relationships in social networks.
Richard Michael
Abstract:
Protein Engineering has a wide range of applications from biotechnology to drug discovery. The design of proteins with the intended properties entails a vast discrete search-space and while we have computational representations and experimental observations available, we lack the methods to adequately combine all available information. In this proposed work we use the state of the art probabilistic optimization and propose a novel machine learning method: principled Bayesian Optimization on latent representations applied to protein variants.
We utilize abundant cheap experimental observations together with various latent information from deep stochastic models. We optimize the target function on a large data-set of lower quality experiments with respect to very scarce high quality experimental candidates through dual output Gaussian Processes. This method promises to predict the highest scoring variants given abundant noisy assay data. The goal is to significantly improve predictions of protein variant candidates with respect to intended function. This project is a collaboration between the Bio-ML group at the Department of Computer Science and the department of Chemistry under joint supervision at the University of Copenhagen.
The successful outcome of the research project would allow us to reduce required experimental time and resources through better computational protein variant proposals. We propose to achieve this by incorporating different data sources and account for epistemic and aleatoric noise.
Amalie Pauli Brogaard
Abstract:
Misinformation and propaganda are recognised as a major threat to people’s judgement and informed decision making in health, politics and news consumption. The spread of misinformation relating to the Covid19 epidemic is just one prominent example. It is not only wrong facts that constitute a threat, but also the language used which can lead to deception and misleading of people. To address the misinformation threat and empower readers confronted with enormous amounts of information, we propose a new data science methodology for the computational analysis of rhetoric in text.
While rhetoric, the art of persuasion, is an ancient discipline, its computational analysis, regarding persuasion techniques, is still in its infancy. We propose a data science project on computational modelling and automatic detection of persuasion techniques at the intersection of Natural Language Processing (NLP) and Machine Learning. We posit that detecting and highlighting persuasion techniques enables critical reading of a text, thereby reducing the impact of manipulative and disingenuous content.
Knowing and understanding fallacies and rhetorical tricks may also help to make stronger, valid arguments in a variety of texts. Moreover, we expect rhetorical information to be beneficial to other semantic language processing tasks and we, therefore, devise approaches to capture and transfer rhetorical knowledge to models for such tasks.
This project will contribute novel models for detecting persuasion techniques, as well as new transfer learning methods for utilising rhetorical knowledge to benefit other semantic text analysis methods.
Rasmus Christensen
Abstract:
Safe and efficient batteries is one of the key technologies for electrification of transport and sustainable energy storage and thus enabling the green transition. The intercalation-type Li-ion battery is by far the most studied and commercially successful battery type. Electrodes in these batteries have traditionally been ordered crystalline materials, but improvements in these materials’ capacity and stability are needed. Recent studies suggest that such improvements can be achieved by the use of electrode materials with different kinds of disorder, for example materials undergoing order-disorder transitions during charge/discharge cycling.
In this project, I propose to use topological data analysis and machine learning methods to enable the computational design of such disordered electrode materials with improved performance. To this end, I have divided the project into four tasks. First, atomic structures of the selected systems will be generated. This will be done using molecular dynamics simulations as well as based on experimental x-ray/neutron scattering data that are analyzed using reverse Monte Carlo, genetic algorithm, or particle swarm optimization algorithms. Second, topological features of these atomic structures will be identified using topological data analysis. When these data are combined with a classification-based machine learning algorithm, it will be possible to construct topological metrics that are correlated to the materials’ propensity to possess large tunnels that enable Li ion motion. Third, models for predicting the dynamics of the conducting Li ions will be constructed using graph neural networks.
Based on this analysis, the relative importance of the various structural environments surrounding the Li ions on their dynamics can be quantified. Fourth, the insights gained in the previous two tasks will be used to design new improved electrode materials based on high-throughput molecular dynamics simulations and machine learning regression models.
Taken as a whole, the proposed research will enable battery scientists to find “order in disorder” in a promising new family of electrode materials, which in turn will enable future development of novel batteries. Two experts in machine learning applications, disordered materials, and topological data analysis will supervise the project.
Ida Burchardt Egendal
Abstract:
Somatic mutations play an integral role in the development of cancer. In the past decade the identification of patterns in the somatic mutations, called mutational signatures, has in- creased in popularity. These signatures are associated with mutagenic processes, such as DNA damage and sun exposure. Although the signatures contain vital information about tu- morigenesis, there is a lack of confidence in the signatures which are estimated predomi- nantly by non-negative matrix factorisation.
We propose an autoencoder alternative to sig- nature extraction which we hypothesize will increase stability and confidence in the signa- tures. These new signatures will be used to diagnose ovarian cancer patients with homolo- gous recombination deficiency, a DNA deficiency that has been shown to be sensitive to PARP inhibitor treatment. Potentially, this test leads to improved identification of ovarian cancer patients who will respond to platinum treatment, a surrogate treatment for PARP inhibitors, which would indicate that the proposed test could successfully act as a predictive biomarker for PARP inhibitor treatment.
The project will deliver a pipeline for confident stratification of cancers based on mutational signatures, providing one step further towards personalised medicine for DNA repair-defi- cient tumours.