Linguistic Anomaly Detection for Translation Quality Assessment

translation

Linguistic Anomaly Detection is a technique for similarity scoring (compression-based, embedding-based, or traditional NLP-based) between source-source and target-target pairs. This involves comparing the similarity of source texts and their corresponding target texts. Rather than being scored against a baseline, however, it has the essential feature of being relativized to the current state of the translation project, such that only the verses a human has already translated/edited serve as the gold standard. Here's an idea for how LAD might work in a translation project:

  1. Suppose you have a source1-target1 pair that has been manually vetted and is considered the gold standard.
  2. You also have a source2-target2 pair that has not been vetted yet.
  3. Compare the similarity of source1 and source2. This could involve comparing the compression results or similarity between the sources and between the targets, doing TF-IDF intersection scoring against your current project, or some other method.
  4. If the similarity between source1 and source2 is 0.8, then the similarity between target1 and target2 should be within a standard deviation of 0.8. (This is my suspicion at least.)
  5. Provide Scores or Reports on Estimated Consistency: The above technique (or a variation thereof) could be used to provide scores or reports on the estimated consistency for the unvetted target text (target2). This allows for a quantitative measure of the translation's internal consistency, and by implication it should indicate something about its overall quality.

There's a lot of room for variation here, but the basic idea is to use a partially complete translation to evaluate the consistency of the rest of the translation. This is a very different approach from the traditional BLEU score, which compares the translation to a baseline (e.g., a human translation or a machine translation). Traditional approaches require you to have a baseline and hypothesis (i.e., two separate versions) for the specific text you are translating. This approach only requires you to have a single target rendering. Instead, this approach compares the translation to itself, which, in my view, is a much more useful metric for translation quality assessment in the first place, since it doesn't necessarily prioritize word-for-word equivalence, which can be deceptive as a quantitative metric.