As data-driven decision making and automated systems achieve unprecedented performance and are increasingly deployed in critical domains such as healthcare, law, and finance, ensuring their trustworthiness and safety becomes paramount. A fundamental aspect of trustworthy AI systems is their ability to accurately assess and communicate their epistemic uncertainty – the extent to which they are uncertain about a given input based on their training data. While various methods exist for quantifying epistemic uncertainty in machine learning models, current approaches either incur prohibitive computational costs or fail to provide sufficiently accurate uncertainty estimates for practical use.
We propose a novel gradient-based framework for estimating epistemic uncertainty in machine
learning models. Our approach obtains a model’s prediction, generates counterfactual outputs as hypothetical labels, and measures the resulting gradient magnitudes to assess how easily the model’s parameters would be influenced by alternatives. The core hypothesis is that higher gradient magnitudes indicate greater epistemic uncertainty, as they reveal the model’s susceptibility to alternative possibilities. While this method is applicable to machine learning models in general, it is particularly wellsuited to NLP tasks due to the natural variability in linguistic expression, which provides a rich space of meaningful counterfactuals.
We will evaluate our method through empirical studies comparing its correlation with established uncertainty quantification techniques on standard benchmarks. We will investigate how our uncertainty measure predicts instances of model hallucination and analyze linguistic patterns that govern effective counterfactual generation. These insights will inform mechanisms to communicate model uncertainty through natural language. Our collaboration with mental healthcare researchers will evaluate the approach on therapeutic dialogue analysis for psychiatric disorders, demonstrating direct real-world impact.
This research aims to advance the field of trustworthy AI by providing a computationally efficient and accurate method for uncertainty quantification. The findings will have immediate practical impact in improving the safety and reliability of machine learning systems, e.g. in mental healthcare where therapeutic dialogue analysis requires high confidence in identified patterns.