This post introduces some of my & co-authors’ work on making Neural Machine Translation (NMT) more reliable. Throughout the post, I will refer to reliability in terms of adherence to certain specifications [4]. In fact, in this post, I will go from looser specifications toward tighter ones for characterizing NMT failure modes. This way of characterizing problems seems like a departure from standard practice in MT, where we are used to thinking about quality in terms of corpus metrics. But as I will discuss, the idea of specifications (as in engineering) is a more powerful way to think about the quality of modern state-of-the-art (SOTA) MT systems. The framing of MT quality in terms of specifications helps us unify several phenomena under a single pedagogy. For example, the simplest specification for an MT system is that given a valid input sentence, it should not output something absolutely irrelevant, and hallucinations could be characterized as samples on which the model breaks this basic specification. Later on, I will discuss how the nature of MT errors necessitates an evaluation protocol in which average-case performance measures are augmented with specifications-based measurements.

Before diving deeper, I would say that three intertwined themes run throughout this discussion. The first is about gaining a mechanistic understanding of NMT failure modes, answering the what and how behind the observed errors. The second is about constructing measurements beyond average-case performance for NMT models, and the third is about building mechanisms to mitigate these failure modes. To limit scope creep, I should mention that this post is not about standardizing MT evaluation or about expanding the scope of MT towards the long tail of languages. To qualify it further, this discussion is mainly concerned with high to mid-resource language pairs, assuming that an adequate amount of data is available for reliability to become the chief research concern for such language pairs. Also, there will be no discussion of ID vs. OOD generalization since, in practice for NMT systems at scale (e.g., trained over half a billion samples along with common augmentation strategies), such a distinction gives no comfort in the real world. In the next section, I will discuss the mechanisms behind hallucinations in NMT.

The Curious Case of Hallucinations: Indeterminacy, Noise, and Memorization

Owing to data availability, sentence-level MT is the dominant mode of assembling MT corpora and building MT systems. The implicit assumption underlying sentence-level MT is that of bitext equivalency, the assumption that the source sentence is perfectly adequate to generate the translation. There are many cases where this assumption is violated, e.g., in the case of translations from a gender-neutral language to a gendered language. However, in this post, I will not be focusing on this cause of hallucination. The breakdown of this assumption can only lead to token-level hallucinations at worst, and I think such errors could be relegated to the category of irreducible errors at the sentence level (and easy to model under additional context). And now, with this cause out of the way, we can turn to analyze more interesting mechanisms.

Once the inductive biases (architecture, regularizers, loss functions) of the model are set, the model itself is the stochastic output of the training data and the training algorithm. Unless you are using differentially private training, there is no guarantee that the model will not allocate any capacity to memorize noisy or long-tailed training samples. Such memorization is helpful for the model to achieve good test performance on rare subpopulations. And this observation is agnostic to the choice of data since any data distribution exposes a frontier of long-tailed samples to the model. Hence, it is natural to expect that the problem of hallucinations in NMT could be traced back to the two obvious culprits: the data and the training algorithm. Our work characterizes the problem of hallucinations in these terms and presents & tests two hypotheses [2]:

  1. The samples memorized by a NMT model are most likely to generate hallucinations when perturbed.
  2. Corpus-level noise patterns dictate the type of hallucinations generated by the NMT model.

In our work, we also propose a taxonomy of hallucinations and show examples of how the training algorithm interacts with the data distribution to create specific hallucination types. Since both memorization and corpus noise are here to stay, it is imperative to develop techniques to integrate their measurements into system development, and the next section in this blog post is about building fine-grained measurements to tackle such problems during system development and evaluation. There’s another type of memorization that acts a bit more insidiously and creates further reliability problems [3], but I will leave that for a future discussion.

Building Measurements of Long-Tailed Errors

Even when we have a mechanistic understanding of most of the NMT failure modes, we (as a community) don’t have measurements beyond average-case performance. Thereby, we don’t have visibility into such problems during system development. While metrics such as BLEU, COMET, and BERTScore are widely used to provide this signal, the assumptions behind metrics do not allow targeting reliability because of two reasons:

  1. Metrics implicitly model fluency and adequacy jointly and only provide a single-dimensional measurement.
  2. Metrics do not have any explicit notion of saliency and are blind to errors in the translation of salient content.

An obvious remedy seems to be behavioral testing, notably CheckList, an evaluation methodology for behavioral testing of NLP models which builds test cases for different capabilities. However, this approach doesn’t generalize to NMT for four key reasons:

  1. There could be multiple valid translations of the same input.
  2. Errors in SOTA NMT systems are rare & vast amounts of test data are required to elicit long-tail errors.
  3. Errors are highly contextual, and fixed (limited-diversity) test cases can’t capture this.
  4. Error distributions across data / tokenization / model / learning / search iterations change.

If we think from the ground-up, we can ask ourselves what should a framework for building measurements for long-tailed errors do? We believe such a framework should address three primary concerns:

  1. Provide targeted error measurements at the instance level (not at the corpus level).
  2. Be scalable to address error rarity and consume only monolingual data (no references).
  3. Be invariant to (unstable) model error distributions (can’t cache errors).

In our work on SALTED [2], we propose a methodology for obtaining measurements through behavioral testing, which leverages monolingual data and relies on specifications of correct behavior. We do so through the construction of “detectors”. A detector is an algorithm, which given an input-output sentence pair, returns a boolean value indicating the presence of an error condition with very high precision. In the paper, we present a methodology for building detectors and show that the same framework could be used for data filtering as well as data synthesis. We show that the data synthesis application could also be used to fix token-level translation errors in the model through fine-tuning. Overall, we believe fine-grained measurement frameworks such as SALTED could greatly contribute to improving NMT reliability.

Finally, I should mention that owing to the medium, I have made some of the ideas above less rigorous; please refer to the below references for a more detailed exposition.


[1] Raunak, et al. The Curious Case of Hallucinations in Neural Machine Translation. NAACL 2021.
[2] Raunak, et al. SALTED: A Framework for SAlient Long-Tail Translation Error Detection. EMNLP 2022.
[3] Raunak, et al. Finding Memo: Extractive Memorization in Constrained Sequence Generation. EMNLP 2022.
[4] Raunak, et al. Operationalizing Specifications for Evaluating Generative Models. HEGM-NeurIPS’22.