ai
In the realm of natural language processing (NLP), tokenization is the fundamental process of breaking down text into smaller, manageable units called tokens. This process, however, becomes increasingly complex and error-prone when dealing with out-of-training or low-resource languages. I'll explore the three approaches to this challenge I've come across that seem most promising, drawing from a recent presentation (slide deck here) I gave at the SIL/Microsoft Hackathon and learnings shared in the Partnership for Applied Biblical NLP meetings.
The Problem: LLMs like the GPT models are trained on vast datasets, predominantly in high-resource languages. When encountering a low-resource language, these models struggle due to a lack of appropriate tokens and training data.
The Solution: Expanding the LLM's vocabulary by training a tokenizer on new, low-resource language text and adding these tokens to the model. This approach has been exemplified by the SIL AI team's experiments with the NLLB (No Language Left Behind) model (led by Michael Martin, details shared by Bethany Moore and Matthew Shannon as part of the presentation linked above). I can't actually track down any links to their work here, but I suspect they will share more public information at some point. Some of the solution was based on forum discussions like this one.
How It Works:
This method apparently has improved the base NLLB results quite a bit, though I don't have any numbers to share. I'm not sure if they've tried this approach with other models, but I suspect it would work with any multilingual model.
The Problem: Standard tokenization methods may not capture the linguistic nuances of low-resource languages, leading to inaccurate or nonsensical translations, or the model might get stuck outputting some unknown token over and over.
The Solution: Utilizing a cipher-based approach to mask inputs and outputs, encouraging the model to rely on linguistic patterns via in-context learning (since the model is simply observing character level patterns rather than inappropriately associating the text in question with tokens mistakenly "recognized" from other languages in the training data). This technique can prevent dominant language probabilities or unknown-token nonsense from overshadowing the target language's unique characteristics.
How It Works:
This approach is particularly useful for languages that share syntactic or phonetic similarities with high-resource languages, reducing the risk of misinterpretation.
See the early experiments by Chris Priebe. I suspect there is a lot more that can be done on this approach, and the use of in-context learning seems like a huge win from my perspective (cf. this post).
The Problem: LLMs may favor more common tokens from high-resource languages, leading to poor representation of low-resource languages.
The Solution: Logits warping involves adjusting the probabilities of certain tokens to bias the model towards a specific language or grammatical structure.
How It Works:
This technique is particularly effective for languages with distinct grammatical structures or non-latin but in-training scripts, ensuring that the model's output aligns more closely with the target language's syntax and orthography.
I've done a few experiments with this approach, and it seems incredibly promising. I will probably write some more about how it might be combined with a system network in the flavour of Systemic Functional Linguistics (SFL) in the future.
Here's a short video clip showing how you can specify a list of tokens (I just passed it [token1, token2, token3]
as the possible outputs). The model then generates text that only includes those tokens.
Here's a second short clip where I force the model to only output Japanese characters.
Tokenizing low-resource languages presents unique challenges in the field of NLP. However, by employing innovative approaches like adding new tokens, cipher-based methods, and logits warping, we can significantly improve the representation and understanding of these languages in LLMs. This not only enhances the accuracy of language models but also promotes linguistic diversity and inclusivity in the digital space. As we continue to advance in NLP, it's crucial to keep exploring and refining these techniques to better serve the global community of diverse language speakers.