Ethically Aligned Reward Models: A Language Model Post-Training Framework

Nicolas Cridlig • Marco Sangiorgi • Andrea Fossa

sommario

Aligning Language Models (LMs) with human moral values is a critical challenge
in the development of responsible AI systems. Reward Models play a central role in this
process by enabling post-training alignment through Reinforcement Learning from Human
Feedback (RLHF) [2]. In this project, we leverage the ETHICS dataset [1] to investigate
the extent to which modern Reward Models can represent and generalize ethical concepts.
Using open-source tools, we plan to create morally congruent Reward Models aligned with
shared human values, intended for integration into RLHF pipelines. Our goal is to develop
a prototype framework capable of evaluating and guiding LMs toward ethically preferable
outputs.

prodotti

forum su Virtuale • url del repo del progetto