Unifying Model, Data, and Training Attribution to Study Model Behavior
Date:
As part of a machine learning conference titled “Advancing AI Through Global Collaboration” co-organized by MCML and Imperial, I presented our ExPLAIND framework, including a sneak peek at our most recent results on multilingual LLMs.
Abstract: Post-hoc interpretability methods typically attribute a model’s behavior to its components, data, or training trajectory in isolation. This leads to explanations that lack a unified view and may miss key interactions. While combining existing methods or applying them at different training stages offers broader insights, such approaches usually lack theoretical support. In this talk, I will present ExPLAIND, a unified framework that integrates all these perspectives. Jointly interpreting model components and data over the training process, I will present some of our most recent findings on generalization in smaller settings like Grokking, as well as in LLMs generalizing over multilingual data.
