Human preference prediction

Many of the use cases for optimizing LLM performance involves human prediction. We will dive into how LLMs can be fine-tuned to predict human preference.

Supervised Fine Tuning - SFT

In supervised fine tunining we use a base LLM and a dataset of human prefered responses. The idea is to train the LLM to predict the which LLM will be preffered.

RLHF

In RLHF, the training is very hard to do. And due to stability reasons also, it is not useful in normal use cases. In bigger setup, it is normally used.

DPO

DPO is one of the common method to optimize for human preference. Unlike RLHF it doesn’t require a lot of resouces. It uses some cleaver tricks.