AI and Low-Resource Translation

translation ai

For any machine translation technique to succeed requires lots of data. When it comes to low-resource languages, you simply don't have lots of data. For this reason, we need to explore other methods and techniques that are somewhat orthogonal to the traditional approach of training a model on a large corpus of data.

Many AI-based translation approaches are based on the idea of transfer learning. This is where you train a model on a large corpus of data, and then fine-tune it on a smaller corpus of data. The idea is that the model will have learned some generalizable features from the large corpus, and then it can learn the specific features of the smaller corpus. This is a great approach, but it still requires a large corpus of data to train on.

Ultimately, we're going to need something quite different to achieve high-quality translation for low-resource languages.

Opening up the black box

One of the biggest challenges with AI-based translation is that it works like a black box. You feed in some data, and you get some data out. But you don't really know what's going on inside the black box. You don't know what the model is learning, or how it's learning it. You don't know what features it's learning, or how it's combining them to produce the output. This should rightly make translators somewhat nervous. How can you trust the output of a model that you don't understand?

My vision for translation assistance with AI is thus to pry open the black box by using LLM-based predictions, rather than simple sequence-to-sequence predictions. This is a fundamentally different approach, and it relies on the concept of in-context learning, rather than a fine-tuning or transfer learning approach.

In-context learning

In short, in-context learning involves providing "training data" to the model on the fly, in the prompt. (Read more from Stanford AI Lab).

This is a fundamental difference that is really critical for translation assistance. It means we don't need to take a bunch of data (e.g., a complete New Testament translation) in order to fine-tune a model for generating more data (e.g., a complete Old Testament translation). Instead, we can take a small amount of data (e.g., a few verses of the New Testament) and use it to generate more data (e.g., the next few verses of the New Testament, or some similar verses being drafted in the Old Testament).

We get two big benefits from this approach:

  1. Bootstrapping: We can use a small amount of data to generate a large amount of data. This is a huge benefit for low-resource languages, since we can use the data we have (which is not enough for fine-tuning a model) to generate more data, and then use that new data (again, without re-fine-tuning our model) to generate even more data, and so on. This is a virtuous cycle that can help us bootstrap our way to a complete translation.
  2. Instant feedback: We get instant updates, allowing the real translator (a human), to provide feedback to the model on the fly.

What do we need to make in-context learning work throughout the translation process?

To leverage in-context learning, we need to be able to draft new translations using few-shot examples (e.g., use a few already-translated verses to draft a new verse). We also need to be able to evaluate those translations using few-shot examples (e.g., use a few already-translated verses to evaluate a new verse). Optionally, we might want or need a back-translation in order to enable third party evaluation (such as a translation consultant).

Working backwards, then:

  • To evaluate, I need
    • a metric
    • [optional] a back-translation
  • To back-translate, I need
    • a translation
    • a non-hallucinating technique
  • To translate, I need
    • a chunk type (a unit of translation)
      • could be a Bible verse
      • could be a larger semantic unit that looks like a phrase
      • I would not recommend using a single word, since a word is a structural unit, and structures are precisely what gets left behind in translation
    • a source text
    • few-shot examples
    • [optional] supporting data
      • glossary
      • notes
      • evaluation feedback on prior drafts
    • [optional] constrain with valid_tokens
    • [optional] a concise, prosaic description of the most relevant source language data
    • [optional] a concise description of the language-typological differences between the source and target languages

Conclusion

Translation assistance can leverage LLMs and in-context learning to open up the black box of AI-based translation. This approach can help us bootstrap our way to a complete translation in a manner that cannot be achieved using traditional machine translation and typical AI-based approaches such as sequence-based model fine-tuning, and it can accept instant feedback from the translator, which means that every single correction or improvement made by the translator gets leveraged in the very next verse or chunk being drafted.

Note: you can also use an approach like this for a multi-agent simulation. See also the Social Approach to Low-Resource Language Translation.