Weak supervision for RLHF

#weak-supervision #RLHF #machine-learning #natural-language-processing #human-feedback #efficient-training #cost-reduction #ai #llm Created at 220323 # [Anonymous feedback](https://www.admonymous.co/louis030195) # [[Epistemic status]] #shower-thought Last modified date: 220323 Commit: 0 # Related # TODO > [!TODO] TODO # Weak supervision for RLHF Weak supervision is a machine learning paradigm in which instead of manually labeling all the training data, only partial or noisy labels are used to train the model. This method is adopted when expert labeled data is either costly or not available. Weak supervision involves collectively modelling interdependent information sources in order to arrive at approximations of true labels, and is therefore most often used in large scale applications such as natural language processing and image recognition. However, weakly supervised models have the potential to produce less accurate results than supervised models since supervision is not comprehensive. Weak supervision can be used for [[RLHF|Reinforcement Learning from Human Feedback]][^1], where the human feedback is often noisy or imprecise. Instead of having to manually label each interaction between the agent and the human, weak supervision can be used to approximate the true feedback. For example, if the human feedback is in the form of binary signals (e.g. good or bad actions), the model can be trained using weakly labelled data where the true feedback is not known for all interactions. This can make the training process more efficient and reduce the overall cost of collecting labeled data. Weak supervision can also be used to identify the most informative interactions to label. By prioritizing the interactions that are most likely to improve the model's performance, the amount of expert labeled data needed can be reduced. This is especially useful in RLHF, where collecting human feedback can be time consuming and costly. Overall, using weak supervision in RLHF can improve the efficiency of the training process and reduce the amount of expert labeled data required, while still producing accurate results. However, it is important to carefully balance the use of weak supervision with the need for accurate feedback in order to ensure the model learns effectively from human feedback. [^1]:https://arxiv.org/pdf/2212.10560.pdf