Lately, Giant Language Fashions (LLMs) have largely improved by scaling. This has primarily concerned growing the dimensions of the LLMs and the info they’re educated on, leading to a extremely resource-intensive course of that may price as much as thousands and thousands of {dollars}.
Whereas LLMs have grow to be ubiquitous, the resource-intensive pre-training course of poses a menace to the inclusion of low-resource languages, the place knowledge is scarce. Usually, that is accompanied by an absence of funding for compute assets.
In our paper, SabiYarn: Advancing Low-Useful resource Languages with multi-task NLP Pre-Coaching, which was accepted on the AfricaNLP workshop on the 2025 ACL, we suggest a sequence of optimization strategies within the LLM pre-training course of that made it potential to coach a SOTA multilingual basis mannequin on Nigerian languages on a single 24 GB GPU.
One among these methods is a mask-based loss computation technique. This easy thought avoids computing loss on enter immediate tokens the mannequin already is aware of. This enables the loss operate to precisely mirror the mannequin’s true efficiency on the tokens that matter and avoids losing compute by backpropagating losses that don’t contribute to the mannequin’s studying course of.
On this article, we’ll discover this method, the way it displays the broader compute-aware pre-training design and its affect on the mannequin’s efficiency.
Immediate tokens are (too) costly in low-resource settings
Throughout pre-training, LLMs are educated in causal language modeling by way of a next-token prediction process. That is sometimes a sluggish course of involving trillions of tokens, whose aim is to cut back the cross-entropy loss between the anticipated token and the label by way of backpropagation. Alongside the way in which, the mannequin acquires a number of expertise, memorizes details, and builds a world mannequin.
For state-of-the-art fashions like Meta’s Llama 4 or OpenAI’s GPT-4, this computationally intensive course of sometimes entails operating hundreds of GPUs for months, performing over 1025 floating-point operations (FLOP).
Let’s take a look at a concrete instance. Given a sequence like “Translate English to Yoruba: I like rice. => Mo fẹ́ràn ìrẹsì,” the mannequin is educated to foretell each token, from the immediate to the precise reply:
Step |
Immediate |
Subsequent token |
|
Translate English to Yoruba: |
|||
Translate English to Yoruba: I |
|||
Translate English to Yoruba: I like |
|||
Translate English to Yoruba: I like rice. |
|||
Translate English to Yoruba: I like rice. -> |
|||
Translate English to Yoruba: I like rice. -> Mo |
|||
Translate English to Yoruba: I like rice. -> Mo fẹ́ràn |
On this setup, all tokens are handled equally, no matter whether or not they’re a part of the immediate or the reply. On the one hand, that is easy to arrange. Then again, it means spending compute on studying to foretell tokens which are already identified and static.
Whereas that is positive in settings with nearly limitless compute, it turns into problematic in resource-constrained coaching. Each token prediction contributes to the overall coaching FLOPs. If half the sequence is an instruction or immediate that by no means modifications, that’s half your compute spent on studying what the mannequin doesn’t have to.
Making do with out instruction-tuning
Because of extreme compute constraints, we couldn’t embody a post-training stage the place fashions are sometimes aligned with user-facing objectives utilizing supervised examples and reinforcement studying from human suggestions (RLHF). In such phases, fashions study not simply to foretell the subsequent token however to generate useful and aligned responses.
For instance, a pre-trained base mannequin might reply to “How are you in the present day” with “?”, finishing the sequence with the probably subsequent token. In distinction, an instruction-tuned mannequin would attempt to present a response that aligns with the aim of being a helpful assistant or chatbot, e.g., “I’m doing good.”
Since post-training wasn’t possible for SabiYarn, we embedded process consciousness straight into the pre-training section. Our aim was to assist the mannequin generalize past fundamental next-token prediction and towards fixing significant duties like named-entity recognition, sentiment evaluation, and translation solely by way of prompt-based conditioning.
In our paper, we suggest a task-specific coaching scheme the place the mannequin is conditioned on the duty it should carry out utilizing XML-like immediate tags. Taking inspiration from the T5 paper, we used the next template:
model_input Mannequin’s output.
For instance, an English-to-Pidgin translation process appears like this:
let me name my father : Make I am going name my Papa
With this structured format, we have been now capable of solely calculate the cross-entropy loss on simply the label tokens (“Make I am going name my Papa”).
That is easy to implement in PyTorch by masking out the immediate tokens within the label tensor. We use -100 because the ignore index, which PyTorch’s cross_entropy loss operate skips:
labels = input_ids.clone()
labels[:, :prompt_len] = -100
Since PyTorch’s cross-entropy loss operate ignores the -100 token by default, the immediate tokens are ignored when calculating the loss for that sequence.
Studying solely what issues
An surprising good thing about this method is improved process focus. Because the mannequin shouldn’t be backpropagating on the enter portion of the sequence, the mannequin’s studying sign comes completely from task-relevant tokens.
Think about a pre-training situation the place an LLM is introduced with:
translate> let me name my father : Make I am going name my Papa
When the loss is computed on each token, the mannequin learns to breed the immediate construction, memorizes the duty tags, and generates the outputs. The educational sign is diluted throughout the complete sequence.
Utilizing loss masking, the mannequin can nonetheless make input-output connections by way of the self-attention mechanism through the ahead go. Nonetheless, backpropagation (studying) solely happens when predicting the output tokens:
We will evaluate this to how we as people study to translate to a brand new language: We obtain the complete enter as context, however studying happens once we’re corrected on our translation, not on the enter sentence already offered to us.
Masking out the enter forces the mannequin to deal with prompts as context relatively than a prediction goal, permitting coaching to give attention to input-output mappings and decreasing the tendency to overfit on immediate formatting.
Investigating the impression of process give attention to coaching efficiency
To substantiate this discovering, we ran an experiment the place we educated the mannequin on a non-trivial drawback of descrambling sentences utilizing the masked loss scheme and a non-masked loss as a comparability.
The duty was to show grammatically incoherent sentences into their coherent kinds utilizing the identical phrases within the enter. For instance, “The equations costly. present is optimization computationally that.” must be corrected to “The equations present that optimization is computationally costly.” This process requires studying advanced relationships between enter and output sequences.
Right here’s what the loss curves appeared like:
We will see that the mannequin converged sooner on the duty when the loss on the enter immediate wasn’t calculated. These effectivity beneficial properties compound dramatically over the complete coaching run and result in sooner convergence.
The price of masking: what are we shedding?
Whereas masking the immediate tokens throughout loss computation helps preserve compute and sharpen focus, it’s not with out tradeoffs. Excluding the prompts from the training sign will increase the chance that the mannequin will fail to adapt to duties the place the immediate construction or phrasing modifications at inference time.
That stated, such tradeoffs should be weighed in opposition to the fact of useful resource constraints. In low-resource coaching situations, approaches that cut back compute whereas preserving core process efficiency are sometimes preferable to totally supervised, resource-intensive options.
The case for native LLMs for African languages
Whereas the broader African LLM neighborhood has targeted its efforts on adapting open-source pre-trained fashions to African languages, pre-training a foundational mannequin from scratch provides the promise of constructing a mannequin that doesn’t inherit the cultural biases of Euro-American corpora. It additionally offers invaluable analysis insights and knowledge about tokenization, switch studying, linguistic patterns, and coaching dynamics for African languages.
An usually uncared for space is the tokenizer. Tokenizers decide how languages are damaged into tokens that LLMs can acknowledge. Coaching from scratch permits us to coach our personal language-specific tokenizers, thereby integrating the morphological and phonological construction, equivalent to tonal diacritics in Yoruba, which additionally carry semantic that means.
It additionally helps with effectivity, as we acquire a tokenizer that successfully tokenizes every language into tokens that acknowledge helpful grammatical constructions, equivalent to affixes and punctuation, which may be utilized by the mannequin to study significant representations. In distinction, utilizing an current tokenizer that isn’t educated on the goal languages results in poor tokenization, with tokens that don’t precisely mirror grammatical construction, inflated sequence lengths, and finally degraded efficiency. That is very true for small fashions, that are interesting as a consequence of their decrease computing calls for.
Wanting ahead, the long run work of our analysis group focuses on exploring trendy LLM architectures, introducing reasoning, instruction following, and test-time computing methods to resource-constrained pre-training. We’re additionally exploring hardware-specific optimizations in coaching and inference and increasing our efforts to much more African languages.