Abstract: Interpretability research in NLP often follows a predictable pattern—pick an indicator of structure or knowledge such as probe or challenge set accuracy, measure that indicator in a fully trained model, and assert that this structure or information is integral to how the model functions. However, we can achieve a much deeper understanding by considering how these indicators emerge from the training process. First, this talk will discuss research on the relationship between interpretable generalization behavior and the presence of multiple basins on the loss landscapes of finetuned text classifiers. Then, I will describe how manipulating interpretable behaviors during the training process can shed light on the role of syntactic signals in attention distributions and generally on simplicity bias both helps and hurts language model performance. These results form the basis of a manifesto for exploring developmental explanations when researching interpretability and generalization behavior.
Bio: Naomi Saphra is a postdoctoral researcher at NYU with Kyunghyun Cho. She is interested in NLP training dynamics: how models learn to encode linguistic patterns or other structure and how we can encode useful inductive biases into the training process. Previously, she earned a PhD from the University of Edinburgh on Training Dynamics of Neural Language Models, worked at Google and Facebook, and attended Johns Hopkins and Carnegie Mellon University. Outside of research, she plays roller derby under the name Gaussian Retribution, does standup comedy, and shepherds disabled programmers into the world of code dictation.
Anyone can participate in the event, but please use the form on DIKU’s website form to register.