publications

2024

On Instruction-Finetuning Neural Machine Translation Models Vikas Raunak, Roman Grundkiewicz, Marcin Junczys-Dowmunt Ninth Conference on Machine Translation (WMT) [ abstract ] [ paper ] [ slides ]
In this work, we introduce instruction finetuning for Neural Machine Translation (NMT) models, which distills instruction following capabilities from Large Language Models (LLMs) into orders-of-magnitude smaller NMT models. Our instruction-finetuning recipe for NMT models enables customization of translations for a limited but disparate set of translation-specific tasks. We show that NMT models are capable of following multiple instructions simultaneously and demonstrate capabilities of zero-shot composition of instructions. We also show that through instruction finetuning, traditionally disparate tasks such as formality-controlled machine translation, multi-domain adaptation as well as multi-modal translations can be tackled jointly by a single instruction finetuned NMT model, at a performance level comparable to LLMs such as GPT-3.5-Turbo. To the best of our knowledge, our work is among the first to demonstrate the instruction-following capabilities of traditional NMT models, which allows for faster, cheaper and more efficient serving of customized translations.
SLIDE: Reference-free Evaluation for Machine Translation using a Sliding Document Window Vikas Raunak, Tom Kocmi, Matt Post Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) [ abstract ] [ paper ]
Reference-based metrics that operate at the sentence-level typically outperform quality estimation metrics, which have access only to the source and system output. This is unsurprising, since references resolve ambiguities that may be present in the source. In this paper, we investigate whether additional source context can effectively substitute for a reference. We present a metric named SLIDE (SLIding Document Evaluator), which operates on blocks of sentences. SLIDE leverages a moving window that slides over each document in the test set, feeding each chunk of sentences into an unmodified, off-the-shelf quality estimation model. We find that SLIDE obtains significantly higher pairwise system accuracy than its sentence-level baseline, in some cases even eliminating the gap with reference-base metrics. This suggests that source context may provide the same information as a human reference in disambiguating source ambiguities. This finding is especially pertinent for reference-free document-level evaluation, wherein SLIDE could provide higher-quality pairwise system assessments while only requiring document boundary annotations.

2023

Dissecting In-Context Learning of Translations in Large Language Models Vikas Raunak, Hany Hassan Awadalla, Arul Menezes Findings of the Association for Computational Linguistics: Empirical Methods in Natural Language Processing [ abstract ] [ paper ]
Most of the recent work in leveraging Large Language Models (LLMs) such as GPT-3 for Machine Translation (MT) through in-context learning of translations has focused on selecting the few-shot demonstration samples. In this work, we characterize the robustness of LLMs from the GPT family to certain perturbations on few-shot translation demonstrations as a means to dissect the in-context learning of translations. In particular, we try to better understand the role of demonstration attributes for the in-context learning of translations through perturbations of high-quality, in-domain demonstrations. We find that asymmetric perturbation of the source-target mappings yield vastly different results. Further, we show that the perturbation of the source side has surprisingly little impact, while target perturbation can drastically reduce translation quality, suggesting that it is the output text distribution that provides the most important learning signal during in-context learning of translations. Based on our findings, we propose a method named Zero-Shot-Context to add this signal automatically in Zero-Shot prompting. Our proposed method greatly improves upon the zero-shot translation performance of GPT-3, thereby making it competitive with few-shot prompted translations.
Leveraging GPT-4 for Automatic Translation Post-Editing Vikas Raunak, Amr Sharaf, Hany Hassan Awadalla, Arul Menezes Findings of the Association for Computational Linguistics: Empirical Methods in Natural Language Processing (EMNLP) [ abstract ] [ paper ]
While Neural Machine Translation (NMT) represents the leading approach to Machine Translation (MT), the outputs of NMT models still require translation post-editing to rectify errors and enhance quality under critical settings. In this work, we formalize the task of direct translation post-editing with Large Language Models (LLMs) and explore the use of GPT-4 to automatically post-edit NMT outputs across several language pairs. Our results demonstrate that GPT-4 is adept at translation post-editing, producing meaningful and trustworthy edits to translations that help improve its general quality as well as remove different classes of major errors in translations. In particular, human evaluations on assessing edit trustworthiness show that GPT-4 exhibits a large improvement over the prior state-of-the-art LLM. Notably, we improve upon state-of-the-art performance on WMT-22 English-Chinese, English-German, Chinese-English and German-English language pairs using GPT-4 based post-editing, as evaluated by state-of-the-art MT quality metrics. However, we also show that GPT-4 could produce hallucinated edits, thereby urging caution in its use as an expert translation post-editor.
Evaluating Metrics for Document-context Evaluation in Machine Translation Vikas Raunak, Tom Kocmi, Matt Post Eight Conference on Machine Translation (WMT) [ abstract ] [ paper ]
Reference-based metrics that operate at the sentence level typically outperform quality estimation metrics, which have access only to the source and system output. This is unsurprising, since references resolve ambiguities that may be present in the source. We investigate whether additional source context can effectively substitute for a reference. We present a metric, SLIDE (SLiding Document Evaluator), which operates on blocks of sentences using a window that slides over each document in the test set, feeding each chunk into an unmodified, off-the-shelf quality estimation model. We find that SLIDE obtains significantly higher pairwise system accuracy than its sentence-level baseline, in some cases even eliminating the gap with reference-base metrics. This suggests that source context may provide the same information as a human reference.
Do GPTs Produce Less Literal Translations? Vikas Raunak, Arul Menezes, Matt Post, Hany Hassan Awadalla The 61st Annual Meeting of the Association for Computational Linguistics (ACL) [ abstract ] [ paper ] [ code ]
Large Language Models (LLMs) such as GPT-3 have emerged as general-purpose language models capable of addressing many natural language generation or understanding tasks. On the task of Machine Translation (MT), multiple works have investigated few-shot prompting mechanisms to elicit better translations from LLMs. However, there has been relatively little investigation on how such translations differ qualitatively from the translations generated by standard Neural Machine Translation (NMT) models. In this work, we investigate these differences in terms of the literalness of translations produced by the two systems. Using literalness measures involving word alignment and monotonicity, we find that translations out of English (E-X) from GPTs tend to be less literal, while exhibiting similar or better scores on MT quality metrics. We demonstrate that this finding is borne out in human evaluations as well. We then show that these differences are especially pronounced when translating sentences that contain idiomatic expressions.
How Good are GPT Models at Machine Translation? A Comprehensive Evaluation Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, Hany Hassan Awadalla [ abstract ] [ paper ] [ code ]
In this work, we present a comprehensive evaluation of GPT models for machine translation, covering various aspects such as quality of different GPT models in comparison with state-of-the-art research and commercial systems, effect of prompting strategies, robustness towards domain shifts and document-level translation, all accompanied with an extensive analysis of the differential aspects of translations produced by GPT. We experiment with 18 different translation directions involving high and low resource languages, as well as non English-centric translations, and evaluate the performance of three GPT models: ChatGPT, GPT3.5 (text-davinci-003), and text-davinci-002. We also show that hybrid approaches, which combine GPT models with other translation systems, can further enhance the translation quality.

2022

Rank-One Editing of Encoder-Decoder Models Vikas Raunak, Arul Menezes 2nd Workshop on Interactive Learning for Natural Language Processing (InterNLP), NeurIPS [ abstract ] [ paper ]
Large sequence to sequence models for tasks such as Neural Machine Translation (NMT) are usually trained over hundreds of millions of samples. However, training is just the origin of a model’s life-cycle. Real-world deployments of models require further behavioral adaptations as new requirements emerge or shortcomings become known. Typically, in the space of model behaviors, behavior deletion requests are addressed through model retrainings whereas model finetuning is done to address behavior addition requests, both procedures being instances of data-based model intervention. In this work, we present a preliminary study investigating rank-one editing as a direct intervention method for behavior deletion requests in encoder-decoder transformer models. We propose four editing tasks for NMT and show that the proposed editing algorithm achieves high efficacy, while requiring only a single instance of positive example to fix an erroneous (negative) model behavior.
Operationalizing Specifications, In Addition to Test Sets for Evaluating Constrained Generative Models Vikas Raunak, Matt Post, Arul Menezes NeurIPS Workshop on Human Evaluation of Generative Models (HGEM) [ abstract ] [ paper ] [ talk ]
In this work, we present some recommendations on the evaluation of state-of-the-art generative models for constrained generation tasks. The progress on generative models has been rapid in recent years. These large-scale models have had three impacts: firstly, the fluency of generation in both language and vision modalities has rendered common average-case evaluation metrics much less useful in diagnosing system errors. Secondly, the same substrate models now form the basis of a number of applications, driven both by the utility of their representations as well as phenomena such as in-context learning, which raise the abstraction level of interacting with such models. Thirdly, the user expectations around these models and their feted public releases have made the technical challenge of out of domain generalization much less excusable in practice. Subsequently, our evaluation methodologies haven’t adapted to these changes. More concretely, while the associated utility and methods of interacting with generative models have expanded, a similar expansion has not been observed in their evaluation practices. In this paper, we argue that the scale of generative models could be exploited to raise the abstraction level at which evaluation itself is conducted and provide recommendations for the same. Our recommendations are based on leveraging specifications as a powerful instrument to evaluate generation quality and are readily applicable to a variety of tasks.
Finding Memo: Extractive Memorization in Constrained Sequence Generation Tasks Vikas Raunak, Arul Menezes Findings of the Association for Computational Linguistics: Empirical Methods in Natural Language Processing [ abstract ] [ paper ] [ code ] [ talk ]
Memorization presents a challenge for several constrained Natural Language Generation (NLG) tasks such as Neural Machine Translation (NMT), wherein the proclivity of neural models to memorize noisy and atypical samples reacts adversely with the noisy (web crawled) datasets. However, previous studies of memorization in constrained NLG tasks have only focused on counterfactual memorization, linking it to the problem of hallucinations. In this work, we propose a new, inexpensive algorithm for extractive memorization (exact training data generation under insufficient context) in constrained sequence generation tasks and use it to study extractive memorization and its effects in NMT. We demonstrate that extractive memorization poses a serious threat to NMT reliability by qualitatively and quantitatively characterizing the memorized samples as well as the model behavior in their vicinity. Based on empirical observations, we develop a simple algorithm which elicits non-memorized translations of memorized samples from the same model, for a large fraction of such samples. Finally, we show that the proposed algorithm could also be leveraged to mitigate memorization in the model through finetuning.
SALTED: A Framework for SAlient Long-Tail Translation Error Detection. Vikas Raunak, Matt Post, Arul Menezes Findings of the Association for Computational Linguistics: Empirical Methods in Natural Language Processing [ abstract ] [ paper ] [ talk ] [ slides ]
Traditional machine translation (MT) metrics provide an average measure of translation quality that is insensitive to the long tail of behavioral problems in MT. Examples include translation of numbers, physical units, dropped content and hallucinations. These errors, which occur rarely and unpredictably in Neural Machine Translation (NMT), greatly undermine the reliability of state-of-the-art MT systems. Consequently, it is important to have visibility into these problems during model development. Towards this direction, we introduce SALTED, a specifications-based framework for behavioral testing of MT models that provides fine-grained views of salient long-tail errors, permitting trustworthy visibility into previously invisible problems. At the core of our approach is the development of high-precision detectors that flag errors (or alternatively, verify output correctness) between a source sentence and a system output. We demonstrate that such detectors could be used not just to identify salient long-tail errors in MT systems, but also for higher-recall filtering of the training data, fixing targeted errors with model fine-tuning in NMT and generating novel data for metamorphic testing to elicit further bugs in models.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model BigScience Workshop, Vikas Raunak, et al. Under Submission [ abstract ] [ paper ] [ code ]
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
GEMv2: Multilingual NLG Benchmarking in a Single Line of Code. Sebastian Gehrmann, Vikas Raunak, et al. Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations [ abstract ] [ paper ] [ code ]
Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models. Aarohi Srivastava, Vikas Raunak, et al. Transactions on Machine Learning Research (TMLR) [ abstract ] [ paper ] [ code ]
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 444 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI’s GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit “breakthrough” behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting

2021

NL-Augmenter: A framework for task-sensitive natural language augmentation. Kaustubh Dhole, Vikas Raunak, et al. Northern European Journal of Language Technology (NEJLT) [ abstract ] [ paper ] [ code ]
Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we presentNL-Augmenter, a new participatory Pythonbased natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of natural language tasks. We demonstrate the efficacy of NL-Augmenter by using several of its tranformations to analyze the robustness of popular natural language models. The infrastructure, datacards and robutstness analysis results are available publicly on the NL-Augmenter repository (https://github.com/GEM-benchmark/NL-Augmenter).
The Curious Case of Hallucinations in Neural Machine Tanslation. Vikas Raunak, Arul Menezes, Marcin Junczys-Dowmunt Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) [ abstract ] [ paper ] [ code ] [ talk ] [ slides ] [ poster ]
In this work, we study hallucinations in Neural Machine Translation (NMT), which lie at an extreme end on the spectrum of NMT pathologies. Firstly, we connect the phenomenon of hallucinations under source perturbation to the Long-Tail theory of Feldman (2020), and present an empirically validated hypothesis that explains hallucinations under source perturbation. Secondly, we consider hallucinations under corpus-level noise (without any source perturbation) and demonstrate that two prominent types of natural hallucinations (detached and oscillatory outputs) could be generated and explained through specific corpus-level noise patterns. Finally, we elucidate the phenomenon of hallucination amplification in popular data-generation processes such as Backtranslation and sequence-level Knowledge Distillation.
Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks. Siddharth Dalmia, Brian Yan, Vikas Raunak, Florian Metze, Shinji Watanabe Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) [ abstract ] [ paper ] [ code ] [ slides ] [ poster ]
End-to-end approaches for sequence tasks are becoming increasingly popular. Yet for complex sequence tasks, like speech translation, systems that cascade several models trained on sub-tasks have shown to be superior, suggesting that the compositionality of cascaded systems simplifies learning and enables sophisticated search capabilities. In this work, we present an end-to-end framework that exploits compositionality to learn searchable hidden representations at intermediate stages of a sequence model using decomposed sub-tasks. These hidden intermediates can be improved using beam-search to enhance the overall performance and also can incorporate external models at intermediate stages of the network to re-score or adapt towards out-of-domain data. One instance of the proposed framework is a Multi-Decoder model for speech translation that extracts the searchable hidden intermediates from the speech recognition sub-task. The model demonstrates the aforementioned benefits and outperforms the previous state-of-the-art by around +6 and +3 BLEU score on the two test sets of Fisher-CallHome benchmark.
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics Sebastian Gehrmann, Vikas Raunak, et al. 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM) [ abstract ] [ paper ] [ code ] [ media ]
We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. However, due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of corpora and evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the initial release for which we are organizing a shared task at our ACL 2021 Workshop and to which we invite the entire NLG community to participate.

2020

On Long-Tailed Phenomena in Neural Machine Translation. Vikas Raunak, Siddharth Dalmia, Vivek Gupta, Florian Metze Findings of the Association for Computational Linguistics: Empirical Methods in Natural Language Processing [ abstract ] [ paper ] [ code ] [ talk ] [ slides ] [ poster ]
State-of-the-art Neural Machine Translation (NMT) models struggle with generating low-frequency tokens, tackling which remains a major challenge. The analysis of long-tailed phenomena in the context of structured prediction tasks is further hindered by the added complexities of search during inference. In this work, we quantitatively characterize such long-tailed phenomena at two levels of abstraction, namely, token classification and sequence generation. We propose a new loss function, the Anti-Focal loss, to better adapt model training to the structural dependencies of conditional text generation by incorporating the inductive biases of beam search in the training process. We show the efficacy of the proposed technique on a number of Machine Translation (MT) datasets, demonstrating that it leads to significant gains over cross-entropy across different language pairs, especially on the generation of low-frequency words. We have released the code to reproduce our results.
Ranking Clarification Questions via Natural Language Inference. Vikas Raunak, Vaibhav Kumar, Jamie Callan 29th ACM International Conference on Information and Knowledge Management (CIKM) [ abstract ] [ paper ] [ talk ] [ slides ]
Teaching machines to ask clarifying questions, given a natural language query is of immense utility in practical natural language processing systems, since such interaction could help in filling information gaps for better machine comprehension of the query. For the task of ranking clarification questions, we hypothesize that determining whether a clarification question pertains to a missing entry in a given post (on QA forums such as StackExchange) could be considered as a special case of Natural Language Inference (NLI), where both the post and the most relevant clarification question point to a shared latent piece of information or context. We validate this hypothesis by incorporating representations from a Siamese BERT model fine-tuned on NLI and Multi-NLI datasets into our models and demonstrate that our best performing model obtains a relative performance improvement of 40 percent and 60 percent respectively (on the key metric of Precision@1), over the state-of-the-art baseline(s) on the two evaluation sets of the StackExchange dataset, thereby, significantly surpassing the state-of-the-art.
On Dimensional Linguistic Properties of the Word Embedding Space. Vikas Raunak, Vaibhav Kumar, Vivek Gupta, Florian Metze 5th Workshop on Representation Learning for NLP (RepL4NLP) [ abstract ] [ paper ] [ code ] [ talk ] [ slides ]
Word embeddings have become a staple of several natural language processing tasks, yet much remains to be understood about their properties. In this work, we analyze word embeddings in terms of their principal components and arrive at a number of novel and counterintuitive observations. In particular, we characterize the utility of variance explained by the principal components as a proxy for downstream performance. Furthermore, through syntactic probing of the principal embedding space, we show that the syntactic information captured by a principal component does not correlate with the amount of variance it explains. Consequently, we investigate the limitations of variance based embedding post-processing algorithms and demonstrate that such post-processing is counter-productive in sentence classification and machine translation tasks. Finally, we offer a few precautionary guidelines on applying variance based embedding post-processing and explain why non-isotropic geometry might be integral to word embedding performance.

2019

On Compositionality in Neural Machine Translation. Vikas Raunak, Vaibhav Kumar, Florian Metze Context and Compositionality in Biological and Artificial Neural Systems Workshop, NeurIPS [ abstract ] [ paper ] [ slides ] [ poster ]
We investigate two specific manifestations of compositionality in Neural Machine Translation (NMT) : (1) Productivity - the ability of the model to extend its predictions beyond the observed length in training data and (2) Systematicity - the ability of the model to systematically recombine known parts and rules. We evaluate a standard Sequence to Sequence model on tests designed to assess these two properties in NMT. We quantitatively demonstrate that inadequate temporal processing, in the form of poor encoder representations is a bottleneck for both Productivity and Systematicity. We propose a simple pre-training mechanism which alleviates model performance on the two properties and leads to a significant improvement in BLEU scores.
On Leveraging the Visual Modality for Neural Machine Translation. Vikas Raunak, Sang Keun Choe, Quanyang Lu, Yi Xu, Florian Metze The 12th International Conference on Natural Language Generation (INLG) [ abstract ] [ paper ] [ slides ] [ poster ]
Leveraging the visual modality effectively for Neural Machine Translation (NMT) remains an open problem in computational linguistics. Recently, Caglayan et al. posit that the observed gains are limited mainly due to the very simple, short, repetitive sentences of the Multi30k dataset (the only multimodal MT dataset available at the time), which renders the source text sufficient for context. In this work, we further investigate this hypothesis on a new large scale multimodal Machine Translation (MMT) dataset, How2, which has 1.57 times longer mean sentence length than Multi30k and no repetition. We propose and evaluate three novel fusion techniques, each of which is designed to ensure the utilization of visual context at different stages of the Sequence-to-Sequence transduction pipeline, even under full linguistic context. However, we still obtain only marginal gains under full linguistic context and posit that visual embeddings extracted from deep vision models (ResNet for Multi30k, ResNext for How2) do not lend themselves to increasing the discriminativeness between the vocabulary elements at token level prediction in NMT. We demonstrate this qualitatively by analyzing attention distribution and quantitatively through Principal Component Analysis, arriving at the conclusion that it is the quality of the visual embeddings rather than the length of sentences, which need to be improved in existing MMT datasets.
On Leveraging Visual Modality for ASR Error Correction Vikas Raunak, Sang Keun Choe, Quanyang Lu, Yi Xu, Florian Metze Workshop on The How2 Challenge: New Tasks for Vision and Language, ICML [ abstract ] [ paper ]
We present our recent efforts on leveraging visual modality for automated speech recognition (ASR) error correction. A visually grounded attentionbased Sequence-to-Sequence (S2S) model is trained to correct contextual and functional word errors in transcripts/outputs of unimodal ASR systems. Specifically, our error correction model address the problem of semantic gap in multimodal fusion, which allows high-level visual features to be combined with comparably high-level text features. Visual-semantic joint embedding and language model are used to rescore the n-best list output by the error correction model. Tested on the How2 dataset, visually grounded error correction led to only marginal improvements over the unimodal ASR system. We provide error analysis on the output of visually grounded ASR error correction model, and propose a potential solution based on the analysis.
Learned in Speech Recognition: Contextual Acoustic Word Embeddings. Vikas Raunak, Shruti Palaskar, Florian Metze IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) [ abstract ] [ paper ] [ slides ]
End-to-end acoustic-to-word speech recognition models have recently gained popularity because they are easy to train, scale well to large amounts of training data, and do not require a lexicon. In addition, word models may also be easier to integrate with downstream tasks such as spoken language understanding, because inference (search) is much simplified compared to phoneme, character or any other sort of sub-word units. In this paper, we describe methods to construct contextual acoustic word embeddings directly from a supervised sequence-to-sequence acoustic-to-word speech recognition model using the learned attention distribution. On a suite of 16 standard sentence evaluation tasks, our embeddings show competitive performance against a word2vec model trained on the speech transcriptions. In addition, we evaluate these embeddings on a spoken language understanding task, and observe that our embeddings match the performance of text-based embeddings in a pipeline of first performing speech recognition and then constructing wordembeddings from transcriptions.
Effective Dimensionality Reduction for Word Embeddings Vikas Raunak, Vivek Gupta, Florian Metze 4th Workshop on Representation Learning for NLP (RepL4NLP) [ abstract ] [ paper ] [ code ]
Pre-trained word embeddings are used in several downstream applications as well as for constructing representations for sentences, paragraphs and documents. Recently, there has been an emphasis on improving the pretrained word vectors through post-processing algorithms. One improvement area is reducing the dimensionality of word embeddings. Reducing the size of word embeddings can improve their utility in memory constrained devices, benefiting several real world applications. In this work, we present a novel technique that efficiently combines PCA based dimensionality reduction with a recently proposed post-processing algorithm (Mu and Viswanath, 2018), to construct effective word embeddings of lower dimensions. Empirical evaluations on several benchmarks show that our algorithm efficiently reduces the embedding size while achieving similar or (more often) better performance than original embeddings. We have released the source code along with this paper.