The Rumsfeld Matrix for AI Reliability, Safety and Alignment Challenges

Reports that say that something hasn’t happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don’t know we don’t know… – Donald Rumsfeld

The above quote is famously known as “there are unknown unknowns” and it gives a surprisingly succinct way to think about risk. I am going to use this quote to structure some reliability challenges into some loose but useful categories. This post is a follow-up to my previous post, but is geared more towards a general, higher-level exposition on tackling reliability challenges in AI systems.

What is reliability of a system?

The reliability of a system (e.g., automobile) could be characterized as measurable properties (e.g., number of repairs per mile) that quantify its adherence to certain specifications (e.g., the automobile should work!). System A is more reliable than System B if system A produces behavior closer to specifications. Reliability is critical for real-world utility – it is hard to trust an automobile for any important transit requirements if it breaks all the time. The meaning of reliability depends on the specifications – a tourist balloon is more reliable if it can withstand pin pricks while a birthday balloon is more reliable if bursts decently when pricked. Further, reliability encompasses system behavior (output) correctness as well as system robustness around an existing correct behavior (output). The problem of alignment is the problem of encoding hard-to-specify behaviors in AI systems, in a manner which is consistent with human values. System A is more aligned than System B if it produces behaviors closer to human-like behaviors. Alignment and reliability are thereby closely related – the methods for making a system more reliable are really the same as methods that make it more aligned – assuming you want reliability with respect to specifications that encode human values. This post is about creating an informal typology for thinking about reliability problems in LLMs (or AI systems which use LLMs) – and is based on my own experiences while improving the reliability of machine translation and speech translation systems.

How to categorize the risk of a system being unreliable?

If we break down the risks of a system being unreliable, such events can be categorized into 3 categories:

The “Unknown Knowns”: These are known failure modes which we don’t know how to model or avoid (e.g, arithmetic with LLMs). This is the realm where most of the iterative research happens.
The “Known Unknowns”: These are known failure modes at unknown places – the long tail. This is the area where highly specific supervision is the only (?) approach to improve reliability, unless something like the mythical Q* comes along as a general solution to systematically probe the vast areas of the input space.
The “Unknown Unknowns”: These are the actual safety risks – failure modes so unknown, we don’t even know where to look.

What are the empirical principles of ensuring reliability?

As general principles for tackling problems in each of the 3 categories above, I think the following directions are useful:

Tackling “Unknown Knowns”: Building causal understanding of model failure modes.
Tackling “Known Unknowns”: Algorithmic long-tailed error enumeration, measurement and localization.
Tackling “Unknown Unknowns”: Principled scaling of large-surface automatic evaluation.

Executing on these three directions has yielded significant benefits for improving the reliability of Neural Machine Translation (NMT) models and I believe that these principles will transfer well for improving LLM reliability as well. I elaborate more on these ideas below.

The Unknown Known: Causal Understanding & Characterization of Failure Modes

In my work on hallucinations in NMT, through simulation experiments, I demonstrated how different hallucinations emerge. These experiments tested potential hypotheses regarding the origins of hallucinations (memorization and dataset noise). Such an exploration not only explained the interaction between data and optimization that caused such hallucinations, but conversely offered mechanisms to debug and fix such errors when they are observed in real data. An applied goal of causal understanding of such failure modes is to build mechanisms that ensure a graceful degradation of application output under such failures. Such mechanisms, whether defined over the inputs, outputs or over model artifacts (e.g., sample probabilities, attention or any intrinsic failure correlate), essentially serve as guardrails when serving applications to general users. With the wide scope of LLM integrations within different computing systems, these reliability guardrails could become a critical differentiating factor for products as well. Another goal related to understanding failure modes is uncertainty quantification of model outputs (& when they break), which will become critical in handling the long-tail (for the model to abstain).

The Known Unknown: Algorithmically Enumerating and Measuring the Long-Tail

Precision measurement is of foundational importance to scientific and engineering endeavors. The greatest challenge in tackling neural models’ failure modes and behavioral problems under this category is the challenge of measurement. Once a reliable and automatic measurement is designed, multiple lines of experiments could be iterated upon to improve the measurement. In general, even though LLMs are best conceptualized as abstractions for reasoning (through in-context learning), their locked-in parametric world knowledge is what makes them loosely self-contained as an application themselves (when combined with instruction finetuning). It is likely that knowledge-seeking queries based on parametric world knowledge have different error distribution characteristics, wherein long-tailed items could be tackled better whereas frequent items could suffer from conflating knowledge representations, which is in contrast to typical long-tailed behavior. As such, LLMs might have at least two separate long-tails of vulnerabilities along these two dimensions (knowledge vs task-specific) and enumerating these error distributions will be important in making them more reliable.

The Unknown Unknown: Principled Scaling of Large-Surface Automatic Evaluations

Scaling-up automatic evaluation is a key challenge with modern LLMs. Typically, the community moves by acting on the simple principles of observation, abstraction and iterating towards an automated measure based on static benchmarks. However, this needs to change as LLMs enter critical application spaces – synthetic-data based evaluation, when the target measurement is non-ambiguous is a promising direction to expand the scope of evaluation.