Enhancing Explainability in Transformer Models for Detecting Sexism on Twitter

Tancredi Bosi • Francesco Farneti • Giovanni Grotto

abstract

The increasing deployment of deep learning models for content moderation on social
media platforms demands not only high performance but also transparency and account-
ability. In this project, we focus on enhancing the explainability of Transformer-based
models in the task of detecting sexist content on Twitter.
Using a labeled dataset of tweets annotated for sexism, we fine-tune a pre-trained Trans-
former model to classify tweets as sexist or non-sexist. To address the black-box nature
of such models, we incorporate multiple interpretability techniques: Local Interpretable
Model-agnostic Explanations (LIME), SHapley Additive exPlanations (SHAP), and at-
tention weight analysis.
These methods offer complementary perspectives on model decision-making, enabling a
deeper understanding of which words or phrases contribute most significantly to classifica-
tion outcomes. Our analysis highlights the strengths and limitations of each explainability
approach in this context, offering insights into both the behavior of the model and the
linguistic patterns associated with online sexism.

outcomes

forum on Virtuale • repo url for the project • final report • slides for the project discussion • video for the project discussion