Three Approaches to Tokenizing Out-of-Training Languages

ai

In the realm of natural language processing (NLP), tokenization is the fundamental process of breaking down text into smaller, manageable units called tokens. This process, however, becomes increasingly complex and error-prone when dealing with out-of-training or low-resource languages. I'll explore the three approaches to this challenge I've come across that seem most promising, drawing from a recent presentation (slide deck here) I gave at the SIL/Microsoft Hackathon and learnings shared in the Partnership for Applied Biblical NLP meetings.

1. Adding New Tokens to Large Language Models (LLMs)

The Problem: LLMs like the GPT models are trained on vast datasets, predominantly in high-resource languages. When encountering a low-resource language, these models struggle due to a lack of appropriate tokens and training data.

The Solution: Expanding the LLM's vocabulary by training a tokenizer on new, low-resource language text and adding these tokens to the model. This approach has been exemplified by the SIL AI team's experiments with the NLLB (No Language Left Behind) model (led by Michael Martin, details shared by Bethany Moore and Matthew Shannon as part of the presentation linked above). I can't actually track down any links to their work here, but I suspect they will share more public information at some point. Some of the solution was based on forum discussions like this one.

How It Works:

  • Train a tokenizer (e.g., a SentencePiece model) on new text from a low-resource language.
  • Integrate these new tokens into an existing multilingual model. You need to make sure the model prioritizes these new tokens.
  • Fine-tune the model with the new training data (in SIL's case, use an existing New Testament translation in the target language for fine tuning).
  • Predict some new text, such as the Old Testament.

This method apparently has improved the base NLLB results quite a bit, though I don't have any numbers to share. I'm not sure if they've tried this approach with other models, but I suspect it would work with any multilingual model.

2. Cipher-Based Approach

The Problem: Standard tokenization methods may not capture the linguistic nuances of low-resource languages, leading to inaccurate or nonsensical translations, or the model might get stuck outputting some unknown token over and over.

The Solution: Utilizing a cipher-based approach to mask inputs and outputs, encouraging the model to rely on linguistic patterns via in-context learning (since the model is simply observing character level patterns rather than inappropriately associating the text in question with tokens mistakenly "recognized" from other languages in the training data). This technique can prevent dominant language probabilities or unknown-token nonsense from overshadowing the target language's unique characteristics.

How It Works:

  • Masking the text of a low-resource language using a cipher.
  • Training the model to recognize and decode these patterns.
  • Allowing the model to generate outputs based on learned patterns rather than defaulting to high-resource language structures.

This approach is particularly useful for languages that share syntactic or phonetic similarities with high-resource languages, reducing the risk of misinterpretation.

See the early experiments by Chris Priebe. I suspect there is a lot more that can be done on this approach, and the use of in-context learning seems like a huge win from my perspective (cf. this post).

3. Logits Warping

The Problem: LLMs may favor more common tokens from high-resource languages, leading to poor representation of low-resource languages.

The Solution: Logits warping involves adjusting the probabilities of certain tokens to bias the model towards a specific language or grammatical structure.

How It Works:

  • Whether on the fly, or via a grammar definition, adjust the model's logits, which are the model's output predictions, to favor tokens from the target low-resource language.
  • This biasing can force the model to generate text in the intended language, adhering to its unique grammatical rules.

This technique is particularly effective for languages with distinct grammatical structures or non-latin but in-training scripts, ensuring that the model's output aligns more closely with the target language's syntax and orthography.

I've done a few experiments with this approach, and it seems incredibly promising. I will probably write some more about how it might be combined with a system network in the flavour of Systemic Functional Linguistics (SFL) in the future.

Here's a short video clip showing how you can specify a list of tokens (I just passed it [token1, token2, token3] as the possible outputs). The model then generates text that only includes those tokens.

Here's a second short clip where I force the model to only output Japanese characters.

Conclusion

Tokenizing low-resource languages presents unique challenges in the field of NLP. However, by employing innovative approaches like adding new tokens, cipher-based methods, and logits warping, we can significantly improve the representation and understanding of these languages in LLMs. This not only enhances the accuracy of language models but also promotes linguistic diversity and inclusivity in the digital space. As we continue to advance in NLP, it's crucial to keep exploring and refining these techniques to better serve the global community of diverse language speakers.