April 7, 2023
Volume 21, issue 1

Download PDF version of this article PDF

Designing a Framework for Conversational Interfaces

Combining the latest advances in machine learning with earlier approaches

Zachary Tellman

The conversational interface is an idea that is forever on the cusp of transforming the world. The potential is undeniable: Everyone has innate, untapped conversational expertise. We could do away with the nested menus required by visual interfaces; anything the user can name is immediately at hand. We could turn natural language into a declarative scripting language and operating systems into IDEs (integrated development environments).

Reality, however, has not lived up to this potential. Most people's use of the conversational agents in their phones and smart devices is limited to reminders and timers—if they use them at all. Semantic Machines is creating a framework for conversational interfaces that is meant to unlock some of this potential. It's currently powering a conversational interface in Outlook Mobile, with other products soon to follow.

To accomplish this, the Semantic Machines framework combines some of the latest advances in machine learning with concepts and approaches dating back to the earliest days of artificial intelligence research. To understand why, let's look back 50 years to one of the first—and still one of the most successful—conversational agents ever created.

A Detour Through History

In 1972, Terry Winograd published a paper ("Understanding Natural Language," Cognitive Psychology 3(1), 1–191) that described a software project he had worked on in the late 1960s. It allowed users to direct a virtual robot arm, named SHRDLU, to interact with a world consisting of a table, a box, and a few blocks of varying shapes and colors. Users could carry on a conversation with SHRDLU, asking questions and giving instructions:

- Pick up a big red block.
OK.

- Find a block which is taller than the one you are holding and put it into the box.
BY "IT", I ASSUME YOU MEAN THE BLOCK WHICH IS TALLER THAN THE ONE I AM HOLDING.
OK.

- What does the box contain?
THE BLUE PYRAMID AND THE BLUE BLOCK.

- How many blocks are not in the box?
FOUR OF THEM.

- Is at least one of them narrower than the one which I told you to pick up?
YES, THE RED CUBE.

- Is it supported?
YES, BY THE TABLE.

- Can the table pick up blocks?
NO.

- Can a pyramid be supported by a block?
YES.

- Can a pyramid support a pyramid?
I DON'T KNOW.

- Stack up two pyramids.
(Trying) I CAN'T.

Winograd's project represents a pivotal point in the history of AI research. Earlier efforts were significantly more ambitious; Simon, Shaw, and Newell's General Problem Solver (GPS), a computer program introduced in 1958, was presented not just as a method for achieving human-like behavior, but also as a descriptive model for human cognition. As became the norm for early AI research, they reduced the problem to one of search. Given an initial state and a desired end state, GPS would search through all possible sequences of actions until it found one that led to that end state. Since the branching factor of the search tree would be very high—you can, in most situations, do almost anything—GPS would need to use heuristics (from the Greek heureka, as in "I've found it!") to determine which actions were likely to be useful in a given situation.

With this research describing the engine for thought, all that remained was knowledge engineering: creating a repository of possible actions and relevant heuristics for all aspects of human life. This, unfortunately, proved harder than expected. As various knowledge engineering projects stalled, researchers focused on problem-solving within "microworlds": virtual environments where the state was easily represented and the possible actions easily enumerated. Winograd's microworld was the pinnacle of these efforts; SHRDLU's mastery of its environment, and the subset of the English language that could be used to describe it, were self-evident.

Still, it wasn't clear how to turn a microworld into something more useful; the boundaries of SHRDLU's environment were relied upon at every level of its implementation. Hubert Dreyfus, professor of philosophy and leading critic of early AI research, characterized these projects as "ad hoc solutions [for] cleverly chosen problems, which give the illusion of complex intellectual activity." Ultimately, Dreyfus was proven right; every attempt to generalize or stitch together these projects failed.

What came next is a familiar story: Funding for research dried up in the mid-1970s, marking the beginning of the AI winter. After some failed attempts in the 1980s to commercialize past research by selling so-called "expert systems," the field lay dormant for decades before the resurgence of the statistical techniques generally referred to as machine learning.

Generally, this early era in AI research is seen as a historical curiosity; a group of researchers made wildly optimistic predictions about what they could achieve and failed. What could they possibly have to teach us? Surely, it's better to look forward to the bleeding edge of research than back at these abandoned microworlds.

We must acknowledge, however, the astonishing sophistication of Winograd's SHRDLU when compared with modern conversational agents. These agents operate on a model called "slots and intents," which is effectively Mad-Libs in reverse. Given some text from the user (the utterance), the system identifies the corresponding template (the intent), and then extracts out pieces of the utterance (the slots). These pieces are then fed into a function that performs the task associated with the intent.

Take, for example, a function order_pizza(size, toppings). A slots-and-intents framework could easily provide a mapping between "Order me a medium pizza with pepperoni and mushrooms" and order_pizza("medium", ["pepperoni", "mushrooms"]). It allows the linguistic concerns to be separated from the actual business logic required to order a pizza.

But consider the second utterance from the conversation with SHRDLU:

- Find a block which is taller than the one you are holding and put it into the box.

This utterance is difficult to model as an intent for several reasons. It describes two actions, but since every intent maps onto a single function, you would have to define a compound function

find_block_and_put_into_box(...)

and define similar functions for any other compound action you would want to support.

But even that's not enough; by simply calling

find_block_and_put_into_box("taller than the one you are holding")

you're letting linguistic concerns bleed into the business logic. At most, you would want the business logic to interpret individual words such as taller, narrower, and so on, but that would require an even more specific function:

find_block_which_is_X_than_held_block_and_put_in_box("taller")

The problem is that natural language is compositional, while slots-and-intents frameworks are not. Rather than defining a set of primitives ("find a block," "taller than," "held block," etc.) that can be freely combined, the developers must enumerate each configuration of these primitives they wish to support. In practice, this leads to conversational agents that are narrowly focused and easily confused.

Winograd's SHRDLU, despite its limitations, was far more flexible. At Semantic Machines, we are building a dialogue system that will preserve that flexibility, while avoiding most of the limitations.

Plans

In the Semantic Machines dialogue system, utterances are translated into small programs, which for historical reasons are called plans. Given the problematic utterance:

- Find a block which is taller than the one you are holding and put it into the box

Our planning model, which is a Transformer-based encoder-decoder neural network, will return something like this:

find_block( (b: Block) => taller_than(b, held_block()) ) put_in_box(the[Block]())

This is rendered in Express, an in-house language that is syntactically modeled after Scala. Notice that each symbol in the plan corresponds almost one-to-one with a part of the utterance, down to a special the() function, which resolves what "it" refers to. This is because the planning model is meant only to translate the utterance, not interpret it.

The reason for this isn't immediately obvious; to most experienced developers, a function such as taller_than would seem to be an unnecessary layer of indirection. Why not just inline it?

find_block( (b: Block) => b.height > held_block().height )

This indirection, however, is valuable. In a normal codebase, function names aren't exposed; they can be assigned any meaning, so long as it makes sense to other people on the team. Conversely, these functions are an interface between the system and the user, and their meaning is defined by the user's intent. Over time, that meaning is almost certain to become more nuanced. We may, for example, realize that when people say "taller than," they mean noticeably taller:

def taller_than(a: Block, b: Block) = (a.height - b.height) > HEIGHT_EPSILON

If the layer of indirection has been maintained, this is an easy one-line change to the function definition, and the training dataset for the planning model remains unchanged. If the function has been inlined, however, the training dataset must be carefully migrated; a.height > b.height should be updated only where it corresponds to "taller than" in the utterance.

Focusing on translation keeps the training data timeless, allowing the dataset to grow monotonically even as we tinker with semantics. Matching each natural language concept to a function keeps the semantics explicit and consistent. This approach, however, assumes the meaning is largely context independent. Our planning model is constrained by the language's type system, so if the utterance doesn't mention blocks, it won't use block-related functions; otherwise, you can assume that "taller than" can always be translated to taller_than.

This, of course, is untrue for indefinite articles such as "it," "that," or "them"; their meaning depends entirely on what was said earlier in the conversation. In the Semantic Machines system, all such references are translated into a call to the(). This is possible because the Express runtime retains the full execution, including all intermediate results, of every plan in the current conversation. This data, stored as a dataflow graph, represents the conversational context: things we've already discussed and may want to reference later. Certain special functions, such as the(), can query that graph, searching for the expression that is being referenced.

In SHRDLU, these indefinite articles were resolved during its parse phase, which transformed utterances into its own version of a plan. Resolution, however, is not always determined by the grammatical structure of the utterance; sometimes you need to understand its semantics. Consider these two commands:

- Put the red block beneath the green block, and the pyramid on top of it.

- Put the red block above the green block, and the pyramid on top of it.

Common sense says that the pyramid should go on whichever block is above the other. To act on this common sense, SHRDLU had to abandon any meaningful separation of syntactic and semantic analysis, which explains, in part, why it was so hard to extend. In the Semantic Machines system, resolution is driven by an entirely separate model, which uses syntactic heuristics where possible and domain-specific semantics where necessary. For most developers, however, it suffices to know that "it" and "that" translate into the().

Constraints

Notice that in the planning model shown earlier, we pass find_block a predicate with the criteria for the block we wish to find:

find_block( (b: Block) => taller_than(b, held_block()) )

This is because the user hasn't revealed which block they want, but only provided the criteria for finding it. This is called intensional description, as opposed to extensional description, which specifies the actual entity or entities. In practice, every entity referenced in conversation is referenced intensionally; a reference to "Alice" would be translated into:

the[Person](p => p.name ~= "Alice")

where ~= means similar to. When executed, the() will try to find a person named Alice somewhere in the conversational history, but there's no guarantee one exists. The user may assume that, given who they are, the system can figure out who they mean. Perhaps there's a particular Alice whom they work with, or someone in their family is named Alice. In either case, the user thinks they've provided enough information, so we have to figure out what makes sense in the given context.

If the() fails to find a match in the conversational context, it will call a resolver function associated with the Person datatype. How should a Person resolver, given a user-provided predicate, actually work? We can't simply scan over a list of all the possible people and apply the predicate as a filter; that dataset lives elsewhere and is unlikely to be easily accessed. Because of both practical and privacy concerns, it will almost certainly be exposed via a service with access controls and a limited API.

The resolver, then, must translate the predicate into one or more queries to back-end services that provide information about people. To do that, we must stop thinking of it as a predicate and start thinking of it as a constraint.

Many developers have likely heard of SAT (satisfiability) solvers, which, given constraints on one or more Boolean values, will try to find satisfying assignments. Given a && !b, it will return a == true, b == false. Given a && !a, it will say that the constraint is unsatisfiable. Since a variety of problems can be mapped into this representation, SAT solvers are widely used. This capability is generalized by SMT (satisfiability modulo theories) solvers, which can solve more complex constraints on a wider variety of data types.

Neither kind of solver, however, has a way to specify that "the value must correspond to an entity in a back-end service." Even if it did, you probably wouldn't want to use it; you don't want the solver to fire off dozens of queries similar to "Alice" to the back-end service while searching through possible values. Only the domain developer building atop the dialogue system understands the capabilities and costs of their back-end services. The query API for a service, for example, might offer its own "similar to" operator. Their similarity metric, however, probably won't reflect that some people use Misha and Mikhail interchangeably. The domain developer will have to maintain a balance between preserving the user's intent and minimizing the number of requests they make per utterance.

Since we can't fully interpret the constraint for the domain developers, we must provide them their own tools for interpretation. Domain functions that, like resolvers, interpret constraints are called controllers. In the current version of the Semantic Machines system, controllers are typically written in TypeScript since that language is likely to be a familiar and expressive way to write complex domain logic. Within the controller, predicates are transformed into constraint zippers, which allow them to traverse, query, and transform constraints on complex data types. For each field and subfield, domain developers can ask various questions: Are there lower or upper bounds? What is an example of a satisfying value? Is that the only satisfying value? Does this value satisfy the constraint?

This last question is crucial, because encoding the entire constraint in the query to the back-end service won't always be possible. The set of results that come back may be too broad and therefore must be post-filtered using the constraint. Conversely, operators that correspond to query operators in the service's API, such as ~=, can be configured as abstract named properties. Upon navigating to Person.name, you can look for an abstract property of ~= and examine its argument's zipper to construct a query.

Early AI researchers envisioned a world where knowledge had a singular representation and a singular repository. Instead, we live in a world where data, and the ability to interpret it, is fragmented and diffuse. As a result, our constraint solver must be unusually extensible, allowing developers to compose it with their own systems and domain expertise.

Revision

A major challenge in interpreting a user's intent is everything that is left unsaid. Stripped of any context, much of what we say is ambiguous. To interpret "I'm headed to the bank," you need to know whether the speaker is near a river. In linguistics, the study of how context confers meaning is called pragmatics. Our dialogue system, then, needs to provide tools for developers to easily specify domain-specific pragmatics.

For example, if in Outlook Mobile a user says, "Reschedule my meeting with Alice to next week," you can reasonably assume they mean an upcoming meeting, because almost everything done in a calendar focuses on upcoming events. If you believed this was always true, you could simply take every user intension about an event and further constrain it to start in the future:

def add_pragmatics( predicate: Event => Boolean, ): Event => Boolean = { e => predicate(e) && e.start > now() }

But what if the user wants to reschedule a past meeting that was canceled? Applying this function to "Reschedule yesterday's meeting with Alice to next week" will constrain the event to be both yesterday and in the future; the constraint will be unsatisfiable. The default assumptions, then, cannot be mixed into whatever the user provides; they must be selectively overridden, just as any other default value. Fortunately, there is a solution that is general across all domains:

def add_pragmatics( predicate: Event => Boolean, ): Event => Boolean = { revise( e => e.start > now(), predicate, ) }

In our system, revise is a powerful operator that, given two constraints a and b, will discard the parts of a that keep b from being meaningful, and conjoin the rest onto b. Consider a query for "yesterday's meeting," where we revise some basic pragmatics with the user's intension:

revise( e => e.start > now() && e.attendees.contains(me()), e => e.start.date == yesterday(), )

The default assumptions are that the event being referenced starts in the future and will be attended by the user. The first clause of those defaults, however, contradicts the user's intension. The result of the revision, then, will consist of the second default clause and the user's intension:

e => e.start.date == yesterday() && e.attendees.contains(me())

Simply looking for contradictions, however, isn't enough. Consider a query for all the events since the year began:

revise( e => e.start > now() && e.attendees.contains(me()), e => e.start > beginning_of_year() )

In this case, the user's intension isn't contradicted by the default assumptions, but it is implied by them. If an event starts in the future, it necessarily occurs after the year began. If we don't drop e.start > now(), we will effectively ignore what the user said.

Since both contradiction and implication are concerned with intrinsic properties of a data type (as opposed to extrinsic properties such as, "This corresponds to an entity in a back-end service"), our system can handle the revision process on its own. Developers can simply focus on defining the appropriate pragmatics for their domains.

The existence of a revision operator, combined with the fact that users speak intensionally, also means that users are able to tweak and build upon what they've already said.

Consider the utterance "Cancel my meeting with Alice." If the user and Alice work on the same team, it's likely they have more than one upcoming meeting together. We can guess at which one they mean, but before actually canceling the meeting, we will show them a description of the event and ask for confirmation.

Typically, confirmation involves giving the user a choice between "OK" and "cancel"; either we did exactly what they wanted, or they need to start over. Revision, however, means there is no need to start over. If the user follows up "Cancel my meeting with Alice" with "I meant the one-on-one," we'll revise the first intension with the second and look for a one-on-one with Alice.

This is enormously freeing for users because it means they don't need to fit everything they want into a single, monolithic utterance. This is akin to the difference between batch and interactive computing; users can try things, see what happens, and quickly build upon their successes.

This is also enormously freeing for developers because it means they can afford to get things wrong. We provide the best tools we can to help developers interpret the user's intent, but the cost of misinterpretation is small. In the worst case, the user will be forced to provide incrementally more information.

Final Thoughts

Wherever possible, business logic should be described by code rather than training data. This keeps our system's behavior principled, predictable, and easy to change. Our approach to conversational interfaces allows them to be built much like any other application, using familiar tools, conventions, and processes, while still taking advantage of cutting-edge machine-learning techniques.

Care must be taken, however, when revisiting ideas from the earlier era of AI research; used wholesale, these initial ideas are likely to send us down the same path as the people who first proposed them. Sometimes, as with plans, we have to make minor modifications. Sometimes, as with constraints, we have to acknowledge complexities that weren't even imagined by early researchers. Sometimes, as with revision, we have to create something entirely novel.

These ideas, properly dusted off and reconsidered, may be an important bridge between our industry and the rapidly expanding frontiers of computational linguistics.

Zachary Tellman works at Microsoft Research, where he is helping to design and build a framework for conversational interfaces. He is the author of Elements of Clojure and is working on a new book, tentatively titled On Software Design.

Originally published in Queue vol. 21, no. 1—
Comment on this article in the ACM Digital Library