Torzi Luca
abstract
In the past years, self-supervised learning emerged as a new paradigm to learn more expres- sive embeddings of the input itself. It is especially used in Large Language Models, like BERT, to learn contextual representations of words. Given the intrinsic structure of the Transformer Encoder learning block capable of accepting every type of input, in the successive years, this Architecture has been employed in the processing of other modalities, such as images; and, as a step forward, CLIP was presented with the aim of creating a Multi-Modal Model connecting image and text representations in a shared embedding dimensionality, learned from a huge amount of (image, text) pairs collected from the internet.
This training methodology, that allows the model to learn from any type of data in a self supervised way, could highlight and enhance bias that exists in society and is inherited in the training data, generating discrimination if used in critical applications. The goal of this project is to understand if the use of a Multi-Modal Model, such as CLIP, that connects text and visual embeddings in a shared space, could increase the bias compared to a BERT model, that is trained on text data only.
outcomes