Tancredi Bosi
• Francesco Farneti
• Giovanni Grotto
abstract
The increasing deployment of deep learning models for content moderation on social
media platforms demands not only high performance but also transparency and account-
ability. In this project, we focus on enhancing the explainability of Transformer-based
models in the task of detecting sexist content on Twitter.
Using a labeled dataset of tweets annotated for sexism, we fine-tune a pre-trained Trans-
former model to classify tweets as sexist or non-sexist. To address the black-box nature
of such models, we incorporate multiple interpretability techniques: Local Interpretable
Model-agnostic Explanations (LIME), SHapley Additive exPlanations (SHAP), and at-
tention weight analysis.
These methods offer complementary perspectives on model decision-making, enabling a
deeper understanding of which words or phrases contribute most significantly to classifica-
tion outcomes. Our analysis highlights the strengths and limitations of each explainability
approach in this context, offering insights into both the behavior of the model and the
linguistic patterns associated with online sexism.
outcomes