As Large Language Models (LLMs) are deployed in higher-stakes scenarios and gain the ability to take real-world actions, ensuring trust in their decision-making and the automation they provide becomes imperative. Trustworthiness comprises two components: trust in the capability of others and trust in their intentions [trusteval, trust]. Building on my previous research in the first component of trustworthiness, this project will investigate the second component by assessing the helpfulness and honesty of LLMs’ ‘intentions.’ Through advancing interpretability research, we aim to better determine when to trust LLMs and detect instances where they may be misaligned with human values. As input-output test cases cannot provide comprehensive guarantees due to language’s infinite variability, in this project, we will focus on devising methodologies that use the model’s internal representations, weights, and activations.
The first part of the project investigates the process during which LLMs are trained to answer and act in accordance with predefined values—the so-called alignment stage. This alignment includes, for instance, ensuring that the models avoid providing guidance on dangerous activities that could pose significant risks to individuals or society. We will investigate how to measure the reliability of this alignment before deploying the model and how to detect misalignments once deployed, providing a general theory of alignment. The second part of the project focuses on understanding the underlying processes of LLMs when solving tasks. Specifically, we will build on the finding that certain intermediate representations in the model encode abstract functions for solving tasks. By studying these function representations, we aim to develop methodologies for interpreting how the model processes and solves tasks, and thus understand with which ‘intention’ the answer is generated. The outcomes of this project will have significant implications for the safe deployment of LLMs, offering methodologies to measure trustworthiness, detect misalignments, and interpret model behavior in real-world scenarios. By advancing our understanding of LLM alignment and cognition, this research will contribute to the development of safer, more transparent, and more reliable AI systems.