{"id":3185,"date":"2025-06-04T12:55:37","date_gmt":"2025-06-04T12:55:37","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=3185"},"modified":"2025-06-04T12:55:37","modified_gmt":"2025-06-04T12:55:37","slug":"rlhf-101-a-technical-tutorial-on-reinforcement-studying-from-human-suggestions-machine-studying-weblog-mlcmu","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=3185","title":{"rendered":"RLHF 101: A Technical Tutorial on Reinforcement Studying from Human Suggestions \u2013 Machine Studying Weblog | ML@CMU"},"content":{"rendered":"<p> <br \/>\n<br \/><img decoding=\"async\" src=\"https:\/\/blog.ml.cmu.edu\/wp-content\/uploads\/2025\/05\/ChatGPT-Image-May-29-2025-09_56_16-AM.png\" \/><\/p>\n<div>\n<p>Reinforcement Studying from Human Suggestions (RLHF) is a well-liked approach used to align AI techniques with human preferences by coaching them utilizing suggestions from individuals, fairly than relying solely on predefined reward features. As an alternative of coding each fascinating conduct manually (which is usually infeasible in advanced duties) RLHF permits fashions, particularly massive language fashions (LLMs), to be taught from examples of what people contemplate good or unhealthy outputs. This strategy is especially essential for duties the place success is subjective or onerous to quantify, resembling producing useful and protected textual content responses. RLHF has change into a cornerstone in constructing extra aligned and controllable AI techniques, making it important for growing AI that behaves in methods people intend.<\/p>\n<p>This weblog dives into the complete coaching pipeline of the RLHF framework. We&#8217;ll discover each stage \u2014 from information era and reward mannequin inference, to the ultimate coaching of an LLM. Our objective is to make sure that all the things is absolutely reproducible by offering all the required code and the precise specs of the environments used. By the top of this put up, you must know the overall pipeline to coach any mannequin with any instruction dataset utilizing the RLHF algorithm of your selection!<\/p>\n<h2>Preliminary: Setup &amp; Surroundings<\/h2>\n<p>We&#8217;ll use the next setup for this tutorial:<\/p>\n<ul>\n<li><strong>Dataset:<\/strong> <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/huggingface.co\/datasets\/allenai\/ultrafeedback_binarized_cleaned_train\">UltraFeedback<\/a>, a well-curated dataset consisting of normal chat prompts. (Whereas UltraFeedback additionally comprises LLM-generated responses to the prompts, we gained\u2019t be utilizing these.)<\/li>\n<li><strong>Base Mannequin:<\/strong> <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/huggingface.co\/meta-llama\/Meta-Llama-3-8B-Instruct\">Llama-3-8B-it<\/a>, a state-of-the-art instruction-tuned LLM. That is the mannequin we&#8217;ll fine-tune.<\/li>\n<li><strong>Reward Mannequin:<\/strong> <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/huggingface.co\/RLHFlow\/ArmoRM-Llama3-8B-v0.1\">Armo<\/a>, a strong reward mannequin optimized for evaluating the generated outputs. We&#8217;ll use Armo to assign scalar reward values to candidate responses, indicating how \u201cgood\u201d or \u201caligned\u201d a response is.<\/li>\n<li><strong>Coaching Algorithm:<\/strong> <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/2404.16767\">REBEL<\/a>, a state-of-the-art algorithm tailor-made for environment friendly RLHF optimization.<\/li>\n<\/ul>\n<p>To get began, clone our repo, which comprises all of the assets required for this tutorial:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"shell\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">git clone https:\/\/github.com\/ZhaolinGao\/REBEL\ncd REBEL<\/pre>\n<p>We use two separate environments for various levels of the pipeline:<\/p>\n<ul>\n<li><code>vllm<\/code>: Handles information era, leveraging the environment friendly <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/vllm-project\/vllm\">vllm<\/a> library.<\/li>\n<li><code>insurgent<\/code>: Used for coaching the RLHF mannequin.<\/li>\n<\/ul>\n<p>You&#8217;ll be able to set up each environments utilizing the offered YAML recordsdata:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"shell\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">conda env create -f .\/envs\/rebel_env.yml\nconda env create -f .\/envs\/vllm_env.yml<\/pre>\n<h2>Half 1: Knowledge Technology<\/h2>\n<p>Step one within the RLHF pipeline is producing samples from the coverage to obtain suggestions on. Concretely, on this part, we&#8217;ll load the bottom mannequin utilizing <code>vllm<\/code> for quick inference, put together the dataset, and generate a number of responses for every immediate within the dataset. The whole code for this half is offered <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/ZhaolinGao\/REBEL\/blob\/main\/src\/ultrafeedback_largebatch\/generate.py\">right here<\/a>.<\/p>\n<p>Activate the <code>vllm<\/code> surroundings:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">conda activate vllm<\/pre>\n<p>First, load the bottom mannequin and tokenizer utilizing <code>vllm<\/code>:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">from transformers import AutoTokenizer\nfrom vllm import LLM\ntokenizer = AutoTokenizer.from_pretrained(\"meta-llama\/Meta-Llama-3-8B-Instruct\")\nllm = LLM(\n    mannequin=\"meta-llama\/Meta-Llama-3-8B-Instruct\",\n    tensor_parallel_size=8,\n)<\/pre>\n<p>Right here, <code>tensor_parallel_size<\/code> specifies the variety of GPUs to make use of.<\/p>\n<p>Subsequent, load the UltraFeedback dataset:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">from datasets import load_dataset\ndataset = load_dataset(\"allenai\/ultrafeedback_binarized_cleaned_train\", cut up=\"prepare\")<\/pre>\n<p>You&#8217;ll be able to choose a subset of the dataset utilizing <code>dataset.choose<\/code>. For instance, to pick the primary 10,000 rows:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">dataset = dataset.choose(vary(10000))<\/pre>\n<p>Alternatively, you&#8217;ll be able to cut up the dataset into chunks utilizing <code>dataset.shard<\/code> for implementations like <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/2405.00675\">SPPO<\/a> the place every iteration solely trains on one of many chunks.<\/p>\n<p>Now, let\u2019s put together the dataset for era. The Llama mannequin makes use of particular tokens to differentiate prompts and responses. For instance:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">&lt;|begin_of_text|&gt;&lt;|start_header_id|&gt;consumer&lt;|end_header_id|&gt;\n\nWhat's France's capital?&lt;|eot_id|&gt;&lt;|start_header_id|&gt;assistant&lt;|end_header_id|&gt;<\/pre>\n<p>Subsequently, for each immediate within the dataset, we have to convert it from plain textual content into this format earlier than producing:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def get_message(instruction):\n    message = [\n        {\"role\": \"user\", \"content\": instruction},\n    ]\n    return message\nprompts = [tokenizer.apply_chat_template(get_message(row['prompt']), tokenize=False, add_generation_prompt=True) for row in dataset]<\/pre>\n<ul>\n<li><code>get_message<\/code> transforms the plain-text immediate right into a dictionary indicating it&#8217;s from the consumer.<\/li>\n<li><code>tokenizer.apply_chat_template<\/code> provides the required particular tokens and appends the response tokens (&lt;|start_header_id|&gt;assistant&lt;|end_header_id|&gt;nn} on the finish with <code>add_generation_prompt=True<\/code>.<\/li>\n<\/ul>\n<p>Lastly, we are able to generate the responses utilizing <code>vllm<\/code> with the prompts we simply formatted. We&#8217;re going to generate 5 responses per immediate:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import torch\nimport random\nimport numpy as np\nfrom vllm import SamplingParams\n\ndef set_seed(seed=5775709):\n    random.seed(seed)\n    np.random.seed(seed)\n    torch.manual_seed(seed)\n    torch.cuda.manual_seed_all(seed)\n\nfor p in vary(5):\n    set_seed(p * 50)\n    sampling_params = SamplingParams(\n        temperature=0.8,\n        top_p=0.9,\n        max_tokens=2048,\n        seed=p * 50,\n    )\n    response = llm.generate(prompts, sampling_params)\n    output = listing(map(lambda x: x.outputs[0].textual content, response))\n    dataset = dataset.add_column(f\"response_{p}\", output)<\/pre>\n<ul>\n<li><code>temperature=0.8, top_p=0.9<\/code> are widespread settings to manage range in era.<\/li>\n<li><code>set_seed<\/code> is used to make sure reproducibility and units a unique seed for every response.<\/li>\n<li><code>llm.generate<\/code> generates the response, and the outcomes are added to the dataset with <code>dataset.add_column<\/code>.<\/li>\n<\/ul>\n<p>You can run the whole scipt with:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"shell\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python .\/src\/ultrafeedback_largebatch\/generate.py --world_size NUM_GPU --output_repo OUTPUT_REPO<\/pre>\n<h2>Half 2: Reward Mannequin Inference<\/h2>\n<p>The second step within the RLHF pipeline is querying the reward mannequin to inform us how good a generated pattern was. Concretely, on this half, we&#8217;ll calculate reward scores for the responses generated in Half 1 what are later used for coaching. The whole code for this half is offered <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/ZhaolinGao\/REBEL\/blob\/main\/src\/ultrafeedback_largebatch\/rank.py\">right here<\/a>.<\/p>\n<p>Activate the <code>insurgent<\/code> surroundings:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"shell\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">conda activate insurgent<\/pre>\n<p>To start, we\u2019ll initialize the Armo reward mannequin pipeline. This reward mannequin is a fine-tuned sequence classification mannequin that assigns a scalar reward rating to a given dialogue based mostly on its high quality.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">rm = ArmoRMPipeline(\"RLHFlow\/ArmoRM-Llama3-8B-v0.1\", trust_remote_code=True)<\/pre>\n<p>Now, we are able to collect the reward scores:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def get_message(instruction, response):\n    return [{\"role\": \"user\", \"content\": instruction}, {\"role\": \"assistant\", \"content\": response}]\n\nrewards = {}\nfor i in vary(5):\n    rewards[f\"response_{i}_reward\"] = []\n    for row in dataset:\n        reward = rm(get_message(row['prompt'], row[f'response_{i}']))\n        rewards[f\"response_{i}_reward\"].append(reward)\nfor okay, v in rewards.gadgets():\n    dataset = dataset.add_column(okay, v)<\/pre>\n<ul>\n<li><code>get_message<\/code> codecs the consumer immediate and assistant response into a listing of dictionaries.<\/li>\n<li><code>rm<\/code> computes a reward rating for every response within the dataset.<\/li>\n<\/ul>\n<p>You&#8217;ll be able to run the whole scipt with:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python .\/src\/ultrafeedback_largebatch\/rank.py --input_repo INPUT_REPO<\/pre>\n<ul>\n<li><code>INPUT_REPO<\/code> is the saved repo from Half 1 that comprises the generated responses.<\/li>\n<\/ul>\n<h2>Half 3: Filter and Tokenize<\/h2>\n<p>Whereas the previous two elements are all we want in principle to do RLHF, it&#8217;s typically advisable in observe to carry out a filtering course of to make sure coaching runs easily. Concretely, on this half, we\u2019ll stroll via the method of making ready a dataset for coaching by filtering excessively lengthy prompts and responses to stop out-of-memory (OOM) points, choosing the right and worst responses for coaching, and eradicating duplicate responses. The whole code for this half is offered <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/ZhaolinGao\/REBEL\/blob\/main\/src\/ultrafeedback_largebatch\/filter_tokenize.py\">right here<\/a>.<\/p>\n<p>Let\u2019s first initialize two completely different tokenizers the place one pads from the fitting and one pads from the left:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">tokenizer = AutoTokenizer.from_pretrained(\"meta-llama\/Meta-Llama-3-8B-Instruct\")\ntokenizer.add_special_tokens({\"pad_token\": \"[PAD]\"})\ntokenizer_left = AutoTokenizer.from_pretrained(\"meta-llama\/Meta-Llama-3-8B-Instruct\", padding_side=\"left\")\ntokenizer_left.add_special_tokens({\"pad_token\": \"[PAD]\"})<\/pre>\n<p>These two completely different tokenizers permit us to pad the immediate from left and the response from the fitting such that they meet within the center. By combining left-padded prompts with right-padded responses, we be sure that:<\/p>\n<ul>\n<li>Prompts and responses meet at a constant place.<\/li>\n<li>Relative place embeddings stay right for mannequin coaching.<\/li>\n<\/ul>\n<p>Right here\u2019s an instance format:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">[PAD] ... [PAD] &lt;|begin_of_text|&gt;&lt;|start_header_id|&gt;consumer&lt;|end_header_id|&gt;\n\nPROMPT&lt;|eot_id|&gt;&lt;|start_header_id|&gt;assistant&lt;|end_header_id|&gt;\n\n\nRESPONSE&lt;|eot_id|&gt;[PAD] ... [PAD]<\/pre>\n<p>We need to be sure that the size of<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">[PAD] ... [PAD] &lt;|begin_of_text|&gt;&lt;|start_header_id|&gt;consumer&lt;|end_header_id|&gt;\n\nPROMPT&lt;|eot_id|&gt;&lt;|start_header_id|&gt;assistant&lt;|end_header_id|&gt;<\/pre>\n<p>is similar for all prompts, and the size of<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">RESPONSE&lt;|eot_id|&gt;[PAD] ... [PAD]<\/pre>\n<p>is similar for all responses.<\/p>\n<p>We filter out prompts longer than 1,024 tokens and responses exceeding 2,048 tokens to stop OOM throughout coaching:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">dataset = dataset.filter(lambda row: tokenizer.apply_chat_template(get_message(row['prompt']), tokenize=True, add_generation_prompt=True, return_tensors=\"pt\").form[-1] &lt;= 1024)\nfor i in vary(5):\n    dataset = dataset.filter(lambda row: tokenizer.apply_chat_template(get_message(response=row[f'response_{i}']), tokenize=True, add_generation_prompt=False, return_tensors=\"pt\")[:, 5:].form[-1] &lt;= 2048)<\/pre>\n<p>Be aware that we skip the primary 5 tokens of responses when counting lengths to exclude particular tokens (e.g. &lt;|begin_of_text|&gt;&lt;|start_header_id|&gt;assistant&lt;|end_header_id|&gt;nn) and solely rely the precise size of the response plus the EOS token (&lt;|eot_id|&gt;) on the finish.<\/p>\n<p>Now we may tokenize the immediate with left padding to a most size of 1,024 tokens:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">llama_prompt_tokens = []\nfor row in dataset:\n    llama_prompt_token = tokenizer_left.apply_chat_template(\n            get_message(row['prompt']), \n            add_generation_prompt=True,\n            tokenize=True,\n            padding='max_length',\n            max_length=1024,\n    )\n    assert len(llama_prompt_token) == 1024\n    assert (llama_prompt_token[0] == 128000 or llama_prompt_token[0] == 128256) and llama_prompt_token[-1] == 271\n    llama_prompt_tokens.append(llama_prompt_token)\ndataset = dataset.add_column(\"llama_prompt_tokens\", llama_prompt_tokens)<\/pre>\n<p>The assertions are used to make sure that the size is all the time 1,024 and the tokenized immediate both begins with <code>[pad]<\/code> token or <code>&lt;|begin_of_text|&gt;<\/code> token and ends with <code>nn<\/code> token.<\/p>\n<p>Then, we choose the responses with the best and lowest rewards for every immediate because the chosen and reject responses, and tokenize them with proper padding:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">chosen, reject, llama_chosen_tokens, llama_reject_tokens, chosen_reward, reject_reward = [], [], [], [], [], []\n\nfor row in dataset:\n\n    all_rewards = [row[f\"response_{i}_reward\"] for i in vary(5)]\n    chosen_idx, reject_idx = np.argmax(all_rewards), np.argmin(all_rewards)\n\n    chosen.append(row[f\"response_{chosen_idx}\"])\n    reject.append(row[f\"response_{reject_idx}\"])\n\n    llama_chosen_token = tokenizer.apply_chat_template(\n            get_message(response=row[f\"response_{chosen_idx}\"]),\n            add_generation_prompt=False,\n            tokenize=True,\n            padding='max_length',\n            max_length=2048+5,\n    )[5:]\n    llama_chosen_tokens.append(llama_chosen_token)\n    chosen_reward.append(row[f\"response_{chosen_idx}_reward\"])\n    assert len(llama_chosen_token) == 2048\n    assert llama_chosen_token[-1] == 128009 or llama_chosen_token[-1] == 128256\n\n    llama_reject_token = tokenizer.apply_chat_template(\n            get_message(response=row[f\"response_{reject_idx}\"]),\n            add_generation_prompt=False,\n            tokenize=True,\n            padding='max_length',\n            max_length=2048+5,\n    )[5:]\n    llama_reject_tokens.append(llama_reject_token)\n    reject_reward.append(row[f\"response_{reject_idx}_reward\"])\n    assert len(llama_reject_token) == 2048\n    assert llama_reject_token[-1] == 128009 or llama_reject_token[-1] == 128256\n\ndataset = dataset.add_column(\"chosen\", chosen)\ndataset = dataset.add_column(\"chosen_reward\", chosen_reward)\ndataset = dataset.add_column(\"llama_chosen_tokens\", llama_chosen_tokens)\ndataset = dataset.add_column(\"reject\", reject)\ndataset = dataset.add_column(\"reject_reward\", reject_reward)\ndataset = dataset.add_column(\"llama_reject_tokens\", llama_reject_tokens)<\/pre>\n<p>Once more the assertions are used to make sure that the lengths of the tokenized responses are all the time 2,048 and the tokenized responses both finish with <code>[pad]<\/code> token or <code>&lt;|eot_id|&gt;<\/code> token.<\/p>\n<p>Lastly, we filter out rows the place the chosen and reject responses are the identical:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">dataset = dataset.filter(lambda row: row['chosen'] != row['reject'])<\/pre>\n<p>and cut up the dataset right into a coaching set and a take a look at set with 1,000 prompts:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">dataset = dataset.train_test_split(test_size=1000, shuffle=True)<\/pre>\n<p>You can run the whole scipt with:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">python .\/src\/ultrafeedback_largebatch\/filter_tokenize.py --input_repo INPUT_REPO<\/pre>\n<ul>\n<li><code>INPUT_REPO<\/code> is the saved repo from Half 2 that comprises the rewards for every response.<\/li>\n<\/ul>\n<h2>Half 4: Coaching with REBEL<\/h2>\n<p>Lastly, we\u2019re now able to replace the parameters of our mannequin utilizing an RLHF algorithm! We&#8217;ll now use our curated dataset and the REBEL algorithm to fine-tune our base mannequin. <\/p>\n<p>At every iteration (t) of REBEL, we goal to resolve the next sq. loss regression downside:<br \/>$$theta_{t+1}=argmin_{thetainTheta}sum_{(x, y, y\u2019)in mathcal{D}_t}left(frac{1}{eta} left(ln fracx){pi_{theta_t}(y|x)} \u2013 ln fracx){pi_{theta_t}(y\u2019|x)}proper) \u2013 left(r(x, y) \u2013 r(x, y\u2019)proper)proper)^2$$<\/p>\n<p>the place (eta) is a hyperparameter, (theta) is the parameter of the mannequin, (x) is the immediate, (mathcal{D}_t) is the dataset we collected from the earlier three elements, (y) and (y\u2019) are the responses for (x), (pi_theta(y|x)) is the chance of era response (y) given immediate (x) underneath the parameterized coverage (pi_theta), and (r(x, y)) is the reward of response (y) for immediate (x) which is obtained from Half 2. The detailed derivations of the algorithm are proven in our <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/2404.16767\" data-type=\"URL\" data-id=\"https:\/\/arxiv.org\/pdf\/2404.16767\">paper<\/a>. Briefly REBEL lets us keep away from the complexity (e.g. clipping, critic fashions, \u2026) of different RLHF algorithms like PPO whereas having stronger theoretical ensures!<\/p>\n<p>On this tutorial, we show a single iteration of REBEL ((t=0)) utilizing the bottom mannequin (pi_{theta_0}). For multi-iteration coaching, you&#8217;ll be able to repeat Elements 1 via 4, initializing every iteration with the mannequin skilled within the earlier iteration.<\/p>\n<p>The whole code for this half is offered <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/ZhaolinGao\/REBEL\/blob\/main\/src\/ultrafeedback_largebatch\/rebel.py\">right here<\/a>. To allow full parameter coaching utilizing 8 GPUs, we use the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/huggingface.co\/docs\/accelerate\/en\/index\">Speed up<\/a> library with <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/DeepSpeed\">Deepspeed<\/a> Stage 3 by working:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">speed up launch --config_file accelerate_cfgs\/deepspeed_config_stage_3.yaml --main-process-port 29080 --num_processes 8 src\/ultrafeedback_largebatch\/insurgent.py --task.input_repo INPUT_REPO --output_dir OUTPUT_DIR<\/pre>\n<ul>\n<li><code>INPUT_REPO<\/code> is the saved repo from Half 3 that comprises the tokenized prompts and responses.<\/li>\n<li><code>OUTPUT_DIR<\/code> is the listing to save lots of the fashions.<\/li>\n<\/ul>\n<h4>Step 1: Initialization &amp; Loading<\/h4>\n<p>We begin by initializing the batch measurement for distributed coaching:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">args.world_size = accelerator.num_processes\nargs.batch_size = args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps\nargs.local_batch_size = args.per_device_train_batch_size * args.gradient_accumulation_steps\nargs.insurgent.num_updates = args.total_episodes \/\/ args.batch_size<\/pre>\n<ul>\n<li><code>args.world_size<\/code> is the variety of GPUs we&#8217;re utilizing.<\/li>\n<li><code>args.local_batch_size<\/code> is the batch measurement for every GPU.<\/li>\n<li><code>args.batch_size<\/code> is the precise batch measurement for coaching.<\/li>\n<li><code>args.insurgent.num_updates<\/code> is the whole variety of updates to carry out and <code>args.total_episodes<\/code> is the variety of information factors to coach for. Usually, we set <code>args.total_episodes<\/code> to be the scale of the coaching set for one epoch.<\/li>\n<\/ul>\n<p>Subsequent, we load the mannequin and tokenizer, guaranteeing dropout layers are disabled such that the logprobs of the generations are computed with out randomness:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">tokenizer = AutoTokenizer.from_pretrained(\n                args.base_model, \n                padding_side=\"proper\",\n                trust_remote_code=True,\n            )\ntokenizer.add_special_tokens({\"pad_token\": \"[PAD]\"})\ncoverage = AutoModelForCausalLM.from_pretrained(\n            args.base_model,\n            trust_remote_code=True,\n            torch_dtype=torch.bfloat16,\n            attn_implementation=\"flash_attention_2\",\n        )\ndisable_dropout_in_model(coverage)<\/pre>\n<h4>Step 2: Coaching<\/h4>\n<p>Wanting once more on the REBEL goal, the one issues we want now to coach is to compute (pi_theta(y|x)) and (pi_{theta_0}(y|x)). We will compute every of them with:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">output = coverage(\n    input_ids=input_ids, \n    attention_mask=attention_mask,\n    return_dict=True,\n    output_hidden_states=True,\n)\nlogits = output.logits[:, args.task.maxlen_prompt - 1 : -1]\nlogits \/= args.job.temperature + 1e-7\nall_logprobs = F.log_softmax(logits, dim=-1)\nlogprobs = torch.collect(all_logprobs, 2, input_ids[:, args.task.maxlen_prompt:].unsqueeze(-1)).squeeze(-1)\nlogprobs = (logprobs * seq_mask).sum(-1)<\/pre>\n<ul>\n<li><code>output.logits<\/code> comprises the logits of all tokens within the vocabulary for the sequence of <code>input_ids<\/code>.<\/li>\n<li><code>output.logits[:, args.task.maxlen_prompt - 1 : -1]<\/code> is the logits of all tokens within the vocabulary for the sequence of response solely. It&#8217;s shifted by 1 for the reason that logits at place (p) are referring to the logits at place (p+1).<\/li>\n<li>We divide <code>logits<\/code> by <code>args.job.temperature<\/code> to acquire the precise chance throughout era.<\/li>\n<li><code>torch.collect<\/code> is used to collect the attitude token within the response.<\/li>\n<li><code>mb_seq_mask<\/code> masks out the paddings.<\/li>\n<\/ul>\n<h4>Step 4: Loss Computation<\/h4>\n<p>Lastly, we may compute the loss by:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">reg_diff = ((pi_logprobs_y - pi_0_logprobs_y) - (pi_logprobs_y_prime - pi_0_logprobs_y_prime)) \/ eta - (chosen_reward - reject_reward)\nloss = (reg_diff ** 2).imply()<\/pre>\n<h2>Efficiency<\/h2>\n<p>With just one iteration of the above 4 elements, we are able to tremendously improve the efficiency of the bottom mannequin on AlpacaEval, MT-Bench, and ArenaHard, three benchmarks generally used to guage the standard, alignment, and helpfulness of responses generated by LLMs.<\/p>\n<figure class=\"wp-block-table\"\/>\n<h2>Takeaway<\/h2>\n<p>On this put up, we outlined the pipeline for implementing RLHF, masking all the course of from information era to the precise coaching part. Whereas we centered particularly on the REBEL algorithm, this pipeline is flexible and may be readily tailored to different strategies resembling <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/2305.18290\">DPO<\/a> or <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/2405.14734\">SimPO<\/a>. The required parts for these strategies are already included aside from the precise loss formulation. There\u2019s additionally a pure extension of the above pipeline to <em>multi-turn<\/em> RLHF the place we optimize for efficiency over a complete dialog (fairly than a single era) \u2014 take a look at our follow-up paper <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/2410.04612\">right here<\/a> for extra info!<\/p>\n<p>If you happen to discover this implementation helpful, please contemplate citing our work:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"raw\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">@misc{gao2024rebel,\n      title={REBEL: Reinforcement Studying by way of Regressing Relative Rewards}, \n      writer={Zhaolin Gao and Jonathan D. Chang and Wenhao Zhan and Owen Oertell and Gokul Swamy and Kiant\u00e9 Brantley and Thorsten Joachims and J. Andrew Bagnell and Jason D. Lee and Wen Solar},\n      yr={2024},\n      eprint={2404.16767},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG}\n}<\/pre>\n<\/p><\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>Reinforcement Studying from Human Suggestions (RLHF) is a well-liked approach used to align AI techniques with human preferences by coaching them utilizing suggestions from individuals, fairly than relying solely on predefined reward features. As an alternative of coding each fascinating conduct manually (which is usually infeasible in advanced duties) RLHF permits fashions, particularly massive language [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":3187,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[110,3029,987,136,113,442,1855,3026,3027,3028],"class_list":["post-3185","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-blog","tag-feedback","tag-human","tag-learning","tag-machine","tag-mlcmu","tag-reinforcement","tag-rlhf","tag-technical","tag-tutorial"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/3185","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3185"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/3185\/revisions"}],"predecessor-version":[{"id":3186,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/3185\/revisions\/3186"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/3187"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3185"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3185"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3185"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d9690a190636c2e0989534. Config Timestamp: 2026-04-10 21:18:02 UTC, Cached Timestamp: 2026-05-06 18:25:58 UTC -->