Nicolas Cridlig
• Marco Sangiorgi
• Andrea Fossa
sommario
Aligning Language Models (LMs) with human moral values is a critical challenge
in the development of responsible AI systems. Reward Models play a central role in this
process by enabling post-training alignment through Reinforcement Learning from Human
Feedback (RLHF) [2]. In this project, we leverage the ETHICS dataset [1] to investigate
the extent to which modern Reward Models can represent and generalize ethical concepts.
Using open-source tools, we plan to create morally congruent Reward Models aligned with
shared human values, intended for integration into RLHF pipelines. Our goal is to develop
a prototype framework capable of evaluating and guiding LMs toward ethically preferable
outputs.
prodotti