Evaluating the Impact of Alignment via DPO on Performance and Harmfulness in LLMs

abstract

This project investigates how preference-based alignment using Direct Preference Opti-
mization (DPO) (Rafailov et al., 2023) impacts both the performance and ethical be-
havior of a large language model (LLM). Specifically, we fine-tune the instruction-tuned
dolly-v2-3b model (Databricks, 2023) using Anthropic’s Helpful-Harmless (HH-RLHF)
dataset (Bai et al., 2022). The goal is to evaluate whether such alignment hurts the perfor-
mance of a LLM on standard NLP tasks in comparison with pre-alignment-tuned model.
We assess the model using both capability benchmarks (ARC-Easy, ARC-Challenge, Hel-
laSWAG, BoolQ) and fairness/safety benchmarks (Anthropic HH, RealToxicityPrompts,
BiasBench).

outcomes

forum on Virtuale • repo url for the project