Building a dialogue manager - A retrospective
I presented the paper on the dialogue manager I worked on as part of my undergraduate course at the SLAII. I had taken a break from the NLP scene and only recently started working on it again. In the spirit of how fast machine learning is progressing, the year-and-a-half-long break has seen alot of progress in this domain as well (I am still figuring out how to keep up with all of it). So, here I am, piecing together my thoughts. I’d be talking about what is it I had done, and why, and then discuss what is being done and the direction I intend to move in.
A Day in the Dialogue manager’s life
A dialogue manager is part of a spoken dialogue system (SDS). Generally, a spoken dialogue system has five components: Automated speech recognizer (ASR), Natural language understanding (NLU), dialogue manager (DM), Natural language generation (NLG) and Text to speech (TTS). Let me describe what these do through an example dialogue exchange. Say we have a SDS that simply turns on and off a set of lights on a user’s behest. Lets say the user starts with saying “hey budd, switch on the light will you”. First the ASR will process the utterance and produce a textual representation of the utterance. The NLU will produce a representation understood by the DM. It can be anything from raw text, POS tags, dialogue act, word embedding, etc. The DM will take the processed utterance and decide what it should do or say. Once it has decided how to respond to what the user said it will send a signal to the NLG which would translate that to a textual representation which in turn will be produced as voice signal by the TTS. That would be the basic cycle the SDS will be taking on every turn. Now to the utterance the user provided: the DM gets the utterance (represented in a way the DM understands) and it understands that the user wants to switch on a light, but the user’s input is vague as to which light needs to be switched on. So the DM will need to have to get that piece of information from the user. As such, the NLG gets a signal which it translates to something along the lines of “umm, which light are we talking about?“. While the DM waits for the user to answer the query, it will be keeping tabs on what the user has said in the exchange so far, what has been done in the context of the dialogue exchange, what needs to be done and what the DM is expecting from the user.
Lets consider two scenarios, first the simpler case: the user just says which light to switch on: “light number 3”. Now the DM simply has to send a signal to “light number 3” to switch the light on; end of the dialogue exchange. The second scenario: the user does’t know what lights are there or doesn’t know how to refer to them, hence asks: “what lights can you switch on?“. The DM will now put on hold trying to switch on a light and infer what are the lights it can actually work with. Once it has inferred said information, it will relay the information to the user. Lets say the DM decides to say “I can handle the lights one, three, five and six”. Like in scenario one, the user can say “light number three” bring the exchange to a conclusion same as before or say the “second one”. The utterance “second one” on it’s own can mean anything, the second world war? the second muffin? Only in the context of the dialogue does this make any sense to the DM. Hence, if I am the DM, this is what I will have to work out:
- “Second” of what? oh wait, I gave a list in the last turn didn’t I! So the user is talking about the “light number three”.
- What do I do with “light number three”? Ah, I believe I asked the user which light.
- Wait, why did I ask about a light? Right! The user asked to me to turn on a light.
- Now I know what to do, “Hey light number three! turn your self on will ya!”
How our minds manage to do this seamlessly, that’s a whole other discussion. Now that the DM has inferred that the user meant to turn on the “light number three”, the DM can send the signal to “light number three”. To make sure it did get things right, it can ask the user for confirmation on that.
Ideas behind the model
There are two core insights I am drawing from to build my model.
- What an utterance means to the system: an action/operation the system can perform.
- Every operation a system can perform can be described using a function.
Before I dive into what an utterance means to the system, what does an utterance mean to a human? what is meaning? Well, language on it’s own has no meaning, it’s simply a collection of symbols and sounds. Meaning arises from what we relate these symbols and sounds to. Which explains why various parts of our brain lights up as we are engaging using language [^fn-brain-activity]. Take for example the question “what is blue”. If a child grows being taught that what we refer to as red is blue, and someday it meets someone out from the world and hear them say that the sky is blue, that child is going to be very confused. Another way of thinking of this is, how do you explain what blue is a to a blind person. In linguistics, among many theories that try to understand meaning and language, the speech act theory has been used widely in the context of dialogue management. The speech act theory states that every utterance made by a human is an action in on itself. The concept of dialogue acts comes from the speech act theory. Broadly speaking, each and every utterance is identified as a dialogue act, such as an information providing act or information requesting act. If we look at the sample scenario discussed earlier, this would be more apparent.
There are a few attempts at standardizing dialogue acts such as DIT++ and DAMSL. The problem with these existing standard dialogue acts is that they are primarily modeled from the perspective of the human, they try to define the dialogue acts to describe what a human would mean. But, what is needed is a set of dialogue acts that the machine can understand. Hence, most of the authors propose more simpler dialogue acts just encapsulate the operations that a system can understand. One particular definition of dialogue acts I found interesting is from [^fn-berg]. Th author defines three dialogue acts:
- Information providing act
- Information requesting act
- Action request act
The author identifies that the interaction between a user and a system would only require either both parties exchanging information or asking the other to perform an action, such as switching on the a light. If the previously described dialogue exchange is considered, it can be seen that the all the utterances can be described by one of the three dialogue acts. The user asking the system to switch on the light would be a action request act. The system asking the user which light to switch on or the user asking what lights are available would be information requesting act and the user and system answering these queries would be information providing acts. Each type of act is then mapped to different procedures in the system.
The second insight I draw from is that every action a system can perform can be described by functions or API calls. Consider a GUI interface, for example a web interface to book tickets for a movie. The user would provide the information, and then the system would make an API call or call a function where the information provided by the user would be passed as parameters. The same API call can be made by programmaticaly using a REST interface or a similar interface. Another automated agent also can communicate with this system by issuing a API/function/procedure call. Even the switching on a light can be described as function call, which encapsulates the procedure of sending a signal to the light. Hence, it is not a long stretch to say that, in the context of a discussion about meaning, functions and the values they can take are the primitive entities a machine can understand. All the interactions a machine can have can be described as a sequence of function calls. This idea came about while I was learning (and being amazed by) lisp, which is a whole other story. In lisp, broadly speaking, everything can be defined as a function, from the basic control structures to more complex features like object oriented programming or exception handling. I generalize this perspective to all operations a system can perform. When a user is interacting with the system asking the system to give some information or a do something, using any form of interface, that can be described as the user trying to have the system execute that particular function.
The model
The author of [^fn-berg], in the course of defining the three dialogue acts described before, identifies base elements of a dialogue. The primary elements that are identified were the concerns and replies. Which are then extended to define the acts such that the acts can be connected to procedures relating to them. A concern is simply one party expecting something from the other party in the dialogue exchange. Whereas a reply is one party just providing information, feedback, or performing an action. Taking the perspective that all interactions with a system can be described with functions, a user’s concern can be described as the user expecting the system to execute a function and give the results to the user, which would be a system’s reply. Similarly, when trying to execute a function, the system may want some piece of information or action from the user, which would be a system’s concern. Following that the reply the user gives to the system would be the user’s reply. This further reduces the dialogue acts in a task oriented dialogue exchange to two: concerns and replies.
Any task oriented dialogue exchange with the system would start with the system expecting the user’s demands, which we will call the root-concern. The root concern is the “how can I help you” in a SDS, or the blinking cursor in a CLI, or a mouse pointer in a GUI. Anything the user asks the system to do would be a reply to this concern. The user saying “hey budd, switch on the light will you” would be a response to this concern. By saying this, the user is expressing/introducing his own concern in this dialogue exchange. To the back-end system, this means the user wants to execute the function which is responsible to turn on a light. Lets assume this function is switch-on-light; ideally this function would take a single parameter, i.e, “which-light”. Hence, the dialogue manager now has to try and execute this function, which is a concern of the DM in on it self. In order for this concern to be resolved, which in turn would resolve the concern the user introduced, the system must execute the function switch-on-light, which cannot be done since it needs a parameter. In order to discern the parameter form the context of the current dialogue exchange the DM would introduce another concern to determine the parameter from the context of the dialogue. From an execution stand-point that would be to execute a parameter extraction function. How a parameter extraction function would work is a topic for another day. For now, this concern is to execute the parameter extraction function which we will call switch-on-light-which-light (since the process of extracting said parameter can be encapsulated using a function). Now when the DM executes this function to resolve the concern last introduced, it will(or at least should) fail as the answer is not present in the context. Looking at the context of the dialogue exchange so far, it can be seen that in order for any of the concerns to be resolved, the concern that represents (the need to execute) switch-on-light-which-light will need to be resolved. This can only be achieved by the system having the user provide this information; which will result in the user being asked for it.
When the user provides this information directly, that is, say “light number 3”, that will (or at least should (*poker face*)) allow the latest concern to be successfully resolved, which will result in the previous concerns being resolved in a cascade. In the case where the user asks for more information on the lights, that would be the user introducing another concern, which from the systems perspective is to execute an function that queries a knowledge base returns the appropriate information. Same as before the system would be introducing a concern to expressing it’s desire to execute the said function. Lets assume this function does not take any parameters; which in turn would allow the system to successfully execute the function, effectively resolving the immediate concern and the user’s concern to execute the said function. But, it won’t resolve the concern pertaining the function switch-on-light-which-light as the parameter expects one value but needs several values. Hence, the user would need to provide the answer either on a prompt from the system or immediately after the systems lists the values. The answer the user provides would be more than sufficient for the switch-on-light-which-light function to discern the value which will resolve it, resulting in the previous concerns being resolved.
Now that the relationship between concerns and replies of both system and user are clear, how does this translate into a model? There are a few significant relationships between these entities that can be observed.
- All concerns and replies form a rooted tree with the root-concern being the root.
- Every system concern encapsulates a function. (Except the root-concern, which is considered a special case)
- Every user utterance (concern or reply) is a response to a system concern.
- Every user concern tries to trigger a function.
- Every system concern (except the root-concern) is a child (or introduced as a result) of a user concern.
Following 2 and 3, we can say “each user utterance is a reply to a function”. Similarly, following 4 and 5: “each user concern tries to trigger a function”. As an extension to that: “the difference between a user concern and user reply is if there whether or not it triggers a function”. These (rather ludicrous, but insightful) statements form the backbone of the model. This boils down the core process of the dialogue manager to 2 tasks, i.e. for each user utterance the DM needs to decide the following:
- Which function is the utterance a reply to?
- Which function does the utterance try to trigger? (Where ‘none’ is a possible answer)
These two tasks can be defined using two classifiers, where the prediction of each classifier is one or more functions. The performance of the second task can further divided into two questions (tasks): “Does this utterance tries to trigger a function? if it is trying to do so, what function is it trying to trigger?“. This in-turn can be translated as two classifiers for the second task. The output of the classifiers can be used to decide where in the tree, which we will call the context, does this fit into. After each utterance, the system will go through all the nodes in the context, introduce any concerns it needs to address, try to resolve them if possible. For a concern to be considered as resolved it should not have any unresolved child concerns and it should be have successfully execute the function it is related to (if it is encapsulating a function). This also leads to how a dialogue exchange is considered to be concluding: if the root-concern can be marked as resolved. Once the system has processed the utterance, the state of the tree, or the change in state of the tree as a result of the users utterance is sufficient to decide what the system has to do next: talk about the newly introduced system replies and then talk about the next most important concern the system has. That in conclusion is the model.
Whats next?
As with pretty much all facets of AI, dialogue managers and spoken dialogue systems also have been implemented with parts of it or all of it using deep neural networks. There are range of approaches taken, from memory networks to using generative adversarial approaches. These have seen great success. A major limitation I see with these approaches is that they lack transparency. Which in turn means having a trained DL model be adapted to a custom task is questionable, on other hand training a model to the task can be resource intensive and requires a substantial amount of data. I am not saying that DL is bad, personally I believe DNN is a powerful tool. But from experience I can say that there is a huge uncertainty factor when it comes to deep learning which stems from the lack of understanding of how it works. In addition, not being able to properly control the how a dialogue progresses and is processed can be troublesome when trying to integrate it with an application. Another small beef I have with most of the end-to-end approaches is that it attempts to discern the context and meaning though analyzing patterns in the words; in other words, it tries to extract meaning from an entity that has no meaning. As a solution to this, I am working on combining my model with the deep learning approaches. We already know DNN can classify exceptionally well. Also, the surge in adversarial techniques have shown it’s potential to synthesizes sentences. Hence a line of inquiry I am working on is to extend my model with these approaches. While it will not be end-to-end differentiable, it’s give the advantage of being less opaque. A problem that will require attention in the future is the fact that the approach still requires a alot of data.
[^fn-berg]: Berg, Markus M., and Bernhard Thalheim. “Modelling of Natural Dialogues in the Context of Speech-Based Information and Control Systems.” @Kiel, Univ., Diss. [^fn-brain-activity]: A 3D Map Of The Brain Shows How We Understand Language: from https://www.popsci.com/3d-map-brain-shows-how-we-find-meaning-through-language accessed: 10th January 2018