VOILA! Agents of Chaos and the Mission of Mechanistic Interpretability
Dr. Natalie Shapira (Khoury College of Computer Sciences, Northeastern University), https://www.khoury.northeastern.edu/people/natalie-shapira/

In this talk Recent advances in AI have led to increasingly autonomous systems exhibiting what is often referred to as agentic behavior, capabilities that include goal-directed planning, adaptation of strategies, decision making, and interaction with complex environments. While such capabilities are promising, they also introduce potential risks, including misalignment and unintended emergent behaviors that are difficult to anticipate or control.
In this talk, I highlight how agentic models can exhibit failure modes that resemble “agents of chaos,” producing unpredictable, misaligned, or strategically opaque behavior.
I argue that such phenomena cannot be adequately addressed through behavioral evaluation alone, nor through existing training paradigms such as reinforcement learning from human feedback (RLHF). Instead, we require mechanistic accounts of how internal representations and computational circuits give rise to agentic behavior. I will survey recent progress in mechanistic interpretability, with a focus on efforts to reverse-engineer learned circuits associated with capabilities such as theory of mind, to develop predictive and causal models of model behavior.
I conclude by asking a broader question: to what extent is mechanistic interpretability necessary to tame agentic systems, and is it sufficient?
1.5
Short Course
Attendance is free and open to everyone interested. Please register via the course link, and you will receive the Zoom meeting details one day before the seminar.
English (with subtitles)
Online
For AIDA students only : In addition to registering via the course link, please click on the “Enroll in this course” button located at the bottom of the page to ensure that the course appears on your AIDA Certificate of Course Attendance upon successful completion.