Enhancing Translation Accuracy with Token Matching Evaluation

translation ai

In the realm of machine translation, achieving the highest level of accuracy is paramount. One approach to improve the precision of translations is through a method called "Token Matching Evaluation". This method centres around the use of valid tokens, which are words or phrases that have appeared in sample translation pairs.

The Process

  1. Populate a list of valid tokens for back-translation: The first step is to create a list of valid tokens. These tokens are the words or phrases that the model is allowed to use for back-translations.

  2. Back-translate using valid tokens: The model is instructed to back-translate any given word only from the list of valid tokens. For example, using the {{select options=valid_tokens n=6}} syntax, the model selects a specific value from a chunk of 6 valid tokens.

    • If a word hasn't even been translated once by a human, then any attempt would effectively be indistinguishable from a hallucination. Therefore, it is important that the valid tokens correspond to reality. For instance, if you need to translate the English word 'Jerusalem', you would find 5 sentences with 'Jerusalem' in them and those sentences would have Abanyom renderings, like 'Yerusalem'. The only valid renderings must be contained in the sample sentences.

    • Any tokens not in the set of valid tokens are simply retained in their translated form in [square brackets] to indicate they have not been back-translated because they are unknown. This could be facilitated by having an agent send out messages to the translation team via WhatsApp, or using a simple gamified app.

  3. Gather samples based on token coverage: It might be beneficial to gather samples based on whether or not they help cover the missing tokens. Start with an initial set of 5 semantically similar examples, followed by a second set of 5 more sentences that include the English glosses not represented in the first five examples.

By following these steps, the output translation will more likely align with the sample translations provided in the prompt, thereby enhancing the accuracy of the translation.