Through alternative splicing, the same gene can be translated into multiple protein isoforms with distinct functions. Isoforms are specific to diseases, tissues and cell types, providing valuable insights into various biological processes. Thus, predicting isoform sequence function is an essential task within bioinformatics. Unfortunately, traditional methods rely on poorly performing tools, integrating various data sources is challenging, and important features like intrinsically disordered regions are frequently overlooked.
To solve this problem, we propose a universal sequence annotation framework utilizing protein Large Language Models (LMMs). The approach detects and annotates functional regions of proteins through a joint analysis of attention scores and embeddings. Our proof-of-concept analysis demonstrated that this method successfully recovers known protein domains. Importantly, we also identify regions, which were annotated by other tools, highlighting the generalizability of this approach. While protein sequences provide direct insight into the function of protein isoforms, the underlying regulation is largely determined at the DNA level through regulatory elements such as splicing enhancers and silencers. By extending our framework to DNA sequences, we aim to capture the regulatory mechanisms that drive isoform diversity and expression. Furthermore, we will incorporate other isoform annotation databases to make a unified annotation tool for both known and novel isoforms.
We anticipate this project will enhance the accuracy and completeness of functional elements, providing a more reliable resource for studying the role of protein isoforms in health and disease. By integrating diverse data sources and creating an adaptable framework, our tool can support both basic research and isoform-targeted therapies. Lastly, we believe our innovative use of LLMs will inspire other researchers to utilize these models in intriguing new ways.