Abstract
This paper presents our proposed multimodal system that is composed of fusion and fission engines that take into account the contextual information of the user. The fusion engine combines various input modalities to determine the overall context of a given situation. It yields the situation that needs action. This action is subdivided into smaller tasks that are sent to actuators, gadgets and other output modalities for implementation. The process is handled by the fission engine. Our goal is to build a system that helps people with disabilities interact with their ambient intelligent environment using their natural communication skills, such as speech and gesture. Both the fusion and fission engines rely on ontology which is the knowledge base of our system while taking into consideration contextual semantic information. The system design is validated through case simulation using formal specifications using colored Petri nets. This work is our contribution to the ongoing research on robotic application to render services to the handicapped and the elderly.
Introduction
Humans interact with their environment using natural communication skills such as speech or gesture. Researchers are trying to imitate man’s brain functions to create machines that are able to communicate with individuals using these natural communication features. Multimodal interaction applications have been demonstrated to offer better flexibility and reliability than other human–machine interaction systems [16]. Hence, these applications are preferred over unimodal interaction applications by users [15]. Indeed, multimodal systems represent a new class of interaction systems capable of interpreting information from various sensory and communication channels. They use a richer and more natural way of communication, such as speech or gesture, and more generally all the five senses. These systems can also integrate computational skills in the real world by offering more natural ways of interaction to humans.
Generally, in the mobile robotic domain, the fusion is done at the signal level and can estimate the positions and angles of the mobile robot to better locate one’s whereabouts. Many researchers worked on sensor fusion because the effective fusion of data from sensors is essential in increasing the robot system’s capability of accomplishing complex tasks.
In this paper, we present a new solution for human–machine interaction by building a smart system that combines fusion and fission engines. In fact, combining these two fundamental components of a multimodal interaction system assures that the combined input modalities are understood and the action corresponding to the given situation is implemented. Our approach uses the semantic representation of the environment for the fusion and the fission process. This ensures a common representation of the data exchanged between the user and the machine which means the data has the same format and the same meaning for both the machine and humans.
Furthermore, compared to existing works, such as the Hierarchical Architecture for Behaviour-Based Robots “BBR” [14], the behaviours in our system are not defined at the beginning of an action but from what is received from the user when requesting a service. In our case, this condition applies to the selection of modalities which is based on semantic context which is in contrast to the BBS system in which the conditions are predefined. Furthermore, by using ontology as the knowledge base in our system, we are able to define the environment in detail and assured the reusability and openness of the system. The BBS systems, in contrast to ours, are based on predefined behaviours and rely on pre-defined conditions regardless of the dynamic context. Our approach takes time into consideration, which is an important aspect of dynamic robotics.
Our system is easily adaptable to various applications. For instance, while our system is applied to a wheelchair with a manipulatable arm in this paper, by simply adjusting the elements of the ontology, we can readapt it to any other human–machine interaction system. This ensures adaptability of the system based on the user’s circumstances in a multitude of domains and aspects. Finally, we have chosen to validate our approach by using Colored Petri Nets with the tool CPN Tools (Aarhus University 2012). This open source tool allows the modeling and validation of distributed systems.
The work presented in this paper is structured as follows: Section 2 presents the problem statement and highlights the novelty of our approach. Section 3 is a summary of existing works and Section 4 presents our system architecture. Section 5 will focus on ontology that is the knowledge base of our system. In Section 6 we will present the modality selection part. Section 7 focuses on the fusion engine and Section 8 on the fission engine. Section 9 presents a command example while Section 10 will focus on the validation of our approach using the Colored Petri Nets. In Section 11, we present a real world implementation example and the Conclusion is presented in Section 12.
Problem statement
This work is based on the concept of a multimodal human–robot interaction applied to a wheelchair with a manipulatable arm. The main challenge of our work is to build fusion and fission engines that combine multimodal inputs while taking into consideration the context of the user. We list down below the various sub-challenges that we encountered and present the proposed solution for each of these challenges.
The first challenge is to choose how to represent the modalities and the context and how the system will understand them. Our proposed solution is building a knowledge base that includes all the elements of the desired environment. To this end, we use the concept of ontology, populate it with all the elements of the system’s environment, and then define the relationship among them. The use of ontology ensures: (i) “openness” of the system – it takes into consideration all input and output modalities relevant to the system; (ii) “regularity” – in describing the composition of the environment, we describe the most adequate elements and scenarios, and (iii) “flexibility” – the system allows the addition and/or the removal of entities according to user’s profile or domain of application.
The second challenge is determining which modality is available and how to consider it as an input or an output modality given that the environment is dynamic. Our solution to this is to build a system that takes into account the semantic contextual information. This can be done by continuously retrieving information from the environment.
The third challenge is the comprehension of a user’s request and merge all input information with it. Our proposed solution is to build a rule-based fusion engine; it will merge the input data according to predefined models in the ontology.
The fourth system demand is how to send elementary subtasks to output modalities and answer the user’s request. Here, we build a rule-based fission engine that subdivides the fusion results using pre-defined patterns in the ontology and sends the results to available outputs modalities.
The fifth and final challenge is the validation of the proposed architecture. Here, we opted to use Colored Petri Nets and CPN Tools to model our system and visualize the simulations.
Related work
In recent years, multimodal fusion has been gaining the attention of researchers of various domains due to the benefits of using multimodal inputs and outputs. In this section, we will present a brief state-of-the-art in this domain. We will end this section by providing a comparison of various approaches with ours and present the novelty of our work.
Multimodal systems
In [17], Oviatt and Cohen presented the benefits of combining multiple modalities on the input and output side of a multimodal system. The benefits reside in making the system more robust and the probability of errors is reduced. In addition, a study on multimodal interaction within an ambient environment is presented in [20]. This work is an exploratory study in determining the relationship between input and output modalities and how the output modalities can influence the choice of the input modalities used by a user.
Multimodal fusion
Since the first multimodal system, the famous Bolt’s system “Put that there” [3], several multimodal systems have been proposed. For instance, Prodanov and Drygajlo [19] built a Bayesian network framework to interpret multimodal signals. The system was used for a dialog between the tour guide RoboX and the visitors of a museum under noisy conditions. The use of Bayesian network in this work allowed the combination of noisy speech recognition with data from a laser scanner used to detect the presence of people near the robot.
Human–computer interaction uses multimodal fusion to interpret the combined input modalities. An example is presented by Reddy and Basir [24] in the concept-based evidential reasoning for multimodal fusion in human–computer interaction. In this work, an approach is proposed for the semantic fusion of different input modalities based on transferable belief models. This architecture is applied on a multimodal system composed of a gesture recognition sensor and a brain computing interface.
Multimodal fission
The fission process consists of subdividing the resulting action as an output to the situation produced by the fusion engine into elementary sub-tasks. In the work on adapting multimodal fission to the user’s abilities [7], a gentle user interface for elderly people (GUIDE) is proposed to enhance the interaction between humans and computers. Their fission engine follows three tasks: message construction, modality selection and output coordination. The work presented in [7] is based on the What–Which–How–Then (WWHT) conceptual model [22] for multimodal interaction. This model describes the life cycle of a multimodal presentation through its evolution within a perpetual change of an interaction context.
The authors in [1] presented their work on a multimodal fusion, fission and virtual reality simulation for an ambient robotic intelligence. In this work, the authors designed agents that communicate with the semantic web using the EKRL language.
Ontology-based works
The concept of ontology is also used for multimodal fusion and fission. For instance, in [18], the authors present the detection of violence in movies by using ontology for the multimodal fusion. Two different fusion approaches are used: the first one is a multimodal fusion that provides binary decisions on the existence of violence, and the second is an ontological and reasoning fusion that combines the audio-visual cues with violence and multimedia ontologies [18].
Furthermore, in [27], the authors used the concept of ontology for modeling of fission rules. The authors present in their work an example of the scenario “bring me a cup” and how the robot has to answer this request using a colored Petri net.
Comparison
Our aim is to propose an innovative work for the control of a wheelchair with a manipulatable arm by building a human–machine interaction system that combines the fusion and the fission processes. Our system will be able to understand the request, merge input modalities and answer the user demands by sending unimodal tasks to output modalities. We have chosen to build a multimodal ontology-based fusion and fission engine using the WWHT [22] model and the ontology concept. We propose a fusion and fission engine able to understand different types of inputs and merge them according to a dynamic context. Also, the system takes into consideration its semantic context unlike other works like the SFX architecture of R.R. Murphy [13] which uses a context-free grammar for moving a robot.
In 1986, Rodney Brooks proposed subsumption architecture [4] where higher levels subsume the actions of lower levels. That is the same as our system’s intelligence that focuses on adaptive architectural design of multi-agent systems. But Brooks’ architecture does not internally store the environment models; it uses environment. There is no function selection mechanism, no ontology nor semantically based memory. The functions use all the inputs present in the system regardless of which input activates the functions. The system is clearly not intelligent because it does not understand the environment but just adapts to it easily. The environment could be a good memory for robots, mechanical parts or moving agents when interacting with the environment but not for standalone reasoning agents without memory. The environment can hide many models and the accuracy is available only if they can be perceived.
Another issue arises when there is a big change in the environment. This will require a new robotic architecture or a more adaptive one and a more intelligent agent to handle it. In the case of Brooks, the environment must be fully controlled. Hence, in terms of architectural design, our agents are robust with parallel, and distributed simple processors. Moreover, our agents also have models and a good memory of past events which make them more powerful, adaptive and cognitive compared to Brooks’. Unlike Brooks’ mind, we are also able to monitor and analyze what is truly happening in our agents mind. Going back to the eighties, Brooks also said: “It doesn’t scale”.
Our semantic agents are rational, knowledge based (ontology and episodic memory) in observable dynamic environment [26], not reactive nor self-reflexive, not goal or utility based because these are not intelligent. Agent communication language relies on a knowledge representation language [8]. BDI (belief desire intention) agents [21] are similar to ours because of that aspect of the BDI software model. In terms of its research, relevance is the existence of logical models through which it is possible to define and reason about agents. But they lead to temporal persistence in plans and further plans are being made on the basis of those to which it is already committed. BDI interpreter has limitations (logic is basic, missing attitudes, not suitable for behavioral learning, no explicit representation of goal, no deliberation or forward planning).
In this work, our agents control a wheelchair and other robots in order to assist some disabled users at home so they continue doing essential jobs in their daily lives [10]. For most of the cases in literature, it is difficult to find temporal and spatial validation [9], and real time re-planning. In the work of Mataric [14], planning or re-planning is only based on static and dynamic obstacles or lock issues in a behavioral path, in local reaction of the environment changes, agent will refine the plan. Conditions are not based on ambient intelligence and not based on memory or semantic context and processing of the agents. Our work is interesting in the sense that it is adaptive in real time; even if applies a good robotics planning approach, its improved algorithm is not based on the intelligence of an agent (multimodal reasoning abilities) and it avoids learning behaviors from observed situations in time.
Among the available methods to solve temporal and spatial validation of knowledge-based agents, formal methods (although very costly in the computational level) are arguably the most powerful ones. One of the most promising approaches in this field is based on Petri nets [2]. CPN Tools is a software application that models Petri nets; it is scientifically recognized as a good stochastic software engineering methodology for randomly checking and evaluating the reactions of agents in all types of good and bad situations. Petri nets are appropriate tools to simulate and validate a software design such as an agent communication [6], and verification and validation of semantic software parts like web services or web based communication agents [11]. It can also be used to model autonomous behaviors for heterogeneous multi-agent systems in order to verify that they can operate within predefined mission requirements and constraints [5]. We validate our approach using CPN Tools. We used the Colored Petri Nets to model the global architecture. To our knowledge, no formal model-based Colored Petri Nets formalism exists that helps the validation of the fusion and fission processes of multimodal interaction systems.
Proposed architecture
The wheelchair with a manipulatable arm is used by people with disabilities to interact with the environment and to provide them with services. The user can use multimodal inputs, such as speech, gesture, etc. to request services. Our proposed architecture, presented in Fig. 1, detects inputs from the environment using sensors, and merges them using the fusion engine to understand its meaning. It then subdivides the needed action (based on situation deduced from the fusion engine) using the fission engine and sends unimodal tasks to output actuators or gadgets, such as the moving arm or a computer screen to execute the action corresponding to user’s request. Our proposed fusion and fission engines are based on the mechanism of pre-condition. The preconditions are satisfied by checking the semantic, the vocabulary, the model and the temporal aspect by continuously returning to the ontology which is the knowledge base of the system. The ontology will be the base of all declarations and information retrieved from the environment.

Multimodal system architecture.
Figure 1 shows a general view of our architecture (the wheelchair represented in the diagram is the Arlyn ChairBot [25]). The architecture is composed of four parts: the input, the multimodal system architecture (MSA), the knowledge base (ontology) and the output. The input part is composed of input modalities coming from the environment.
The term environment represents the physical location and all the entities present in the user’s surroundings. These events will be sent to the MSA. The MSA is where the fusion and fission take place. The MSA is composed of four parts: Input and Output Modality Selection, Fusion and Fission Engines. Both modality selection parts decide which modality is accepted according to the context state. For example, if the environment is too noisy, the system will not select a vocal command nor send a vocal acknowledgment to the user. Hence, the modality selection component deactivates all the modalities associated with vocal modalities, such as the voice sensor, the speaker, and the speech recognition entity. By referring to the knowledge base, the MSA will merge the information obtained from the environment using the fusion engine, get the appropriate resulting action and subdivide the action into subtasks using the fission engine.
The knowledge base is the ontology that describes in details all the elements of the environment and the relationship among them. The fusion models and the fission patterns are likewise stored in the ontology. Finally, the output part is composed of output modalities used by the system to inform the user about the progress of the process or to answer a user’s request.

The ontology diagram, (a) the class Environment and its relationship with other entities, (b) the Context class and its subclasses.
The ontology allows a large number of elements that describe the user’s environment to be taken into consideration. Also, it allows the integration of rules that must be taken into account during the modality selection as well as in the fusion and the fission process. These rules are stored in the ontology using the semantic web rule languages SWRL (for rules) and SQWRL (for queries) based on W3C standards. We use the open source tool PROTEGE (www.protege.stanford.edu) as the software application to implement ontology.
In our work, the environment of the robotic system (the wheelchair with a manipulatable arm) is also the user’s environment. As such, this environment shall be the interior of the user’s home and all the elements that it contains, and the outdoor which can be a garden or the neighborhood. The environment ontology, represented by a hierarchical graph, will allow reasoning based on semantic information. When building ontology we have to describe the environment in a hierarchical way. Indeed, the definition of the environment composition is made by defining the essential classes and their subclasses. Hence, each class has to be populated with individuals according to their affiliation. For example, the class Food will have Fruit and Medication as subclasses and apple and pills will be an individual of each class respectively.
Furthermore, to define the relationship among entities, we use object properties (for linking two objects) and data properties (to connect an object with a value) to give a sense to our ontology. In addition to the properties mentioned above, we use semantic relations that allow the robot to understand some actions that are common knowledge to people. For example, we can define a relationship between a door and its state. This relationship will allow the robot to understand that, when moving around, if a door is open, he can pass through it and if the door is closed, he has to open it first before passing through it.

Object property “Is Affected By” and data type property “has Light”.
Figure 2 shows our ontology. The class Environment, in Fig. 2(a), is the super class that contains all the entities present around the user. These entities are described in different classes according to their types:
Input Modality – contains all the modalities used in our system; we have chosen to use speech, gesture, eye gaze, manual modality (the keyboard and the mouse of a computer) and touch.
Output Modality – it is composed of moving output, visual output and vocal output.
Alarm – it contains the alarms used by our system, and can be system alarm or health alarm.
Surroundings – is composed of all the elements present in the surroundings of the user. These may be people, objects and places surrounding the user.
Vocabulary – this contains the words used when a user requests for a service.
Time – is for the maximum time allowed for a command to be accepted, and the maximum time between two successive modalities.
System context – takes into account the information obtained from sensors embedded in the wheelchair. We have ultrasonic and infrared sensors which are placed on the four edges of the wheelchair and detect obstacles when moving.
Context – presented in details in Fig. 2(b) is composed of the environment context, the health context, the user’s context and the system context. For instance, the environment context takes into account the elements which can affect modalities, such as the lighting level, the noise level, the weather conditions, and the ambient temperature.
The fusion model and the fission pattern subclasses – contain our predefined models of different commands and solutions used in the fusion and the fission engines, respectively.
After defining the classes, we get into the details of each class by adding the instances (individuals) that might be part of it. By instances, we mean all people, objects, vocabulary, etc. This definition will give us an accurate overview of the user’s environment composition. Indeed, the advantage of ontology is assuring the openness of the system. This mean that the definition of the instances is not exhaustive and more elements can be added or deleted at any time according to application needs. Knowing that the environment of a person is composed of a multitude of elements, we have chosen to present a small number of instances defined in our ontology as an example in this paper. For instance, the class “object used for liquids” is composed of cup, mug, jug, glass and bottle. Also, the class “object pronoun” is composed of this, that, those, it and these.
We then define the relations among different entities in the environment using properties. We use object properties to link two objects, and data type properties to link an object to XML schema datatype or rdf: literal. In this paper, due to space constraints, we will present two representative examples rather than all the properties used. Consider, for instance, Fig. 3. It shows two of the properties used in our ontology.

Result of the query for noise related modality deactivation.

Result for output modality selection.
The object property “Is Affected By” links the individual “Lighting level sensor” to the individuals that depend on the light level (i.e. luminosity) to work properly. These refer to the screen for output modalities, and gesture, keyboard, touch screen, mouse and eye gaze sensor for input modalities. Hence, if the light level is low, the modalities will be deactivated because no data can be detected correctly. We also find the datatype property “has Light” which links the individual “Lighting level sensor” to its value, say 30 lux. In addition to the properties mentioned above, we use semantic relations to allow the robot to understand some actions that are common knowledge to people. For example, we can define a relationship between liquid drinks and objects used for liquids. This relationship will allow the robot to understand that a liquid needs a container, and when the user requests a drink, the need for using an object that is able to hold liquids (for example, a glass or a bottle) is implied.
A multimodal system implies the use of multiple modalities for inputs and outputs. Indeed, multimodal systems need to understand these modalities and select the available ones according to the context state. When choosing a modality, the system has to retrieve the contextual information continuously and compare the result to values defined in the knowledge base.
In addition to building and representing the ontology, the Protégé tool allows integration of rules and queries using SWRL and SQWRL languages. Hence, to check the availability of modalities according to the context, we have defined four queries for input modalities and four queries for output modalities selection. For instance, the input modality “Voice” is linked to the noise level. If the system detects a noise level higher than the maximum allowed level defined in the ontology, that is 60 dB, the Voice Sensor will be deactivated. We then assume, in this example, that the environment around the system is noisy and that the detected noise is 75 dB. Assume further that all other context information values are in the accepted ranges. Based on this data, the query for the selection of modalities is given below:
The result of this query is presented in Fig. 4. The diagram shows that the voice sensor will be deactivated because of the noise level detected, being 75 dB. This implies that the wheelchair user will not be able to make a vocal command due to a noisy environment. The modality check will also be done for all the modalities linked to handicap type, weather conditions and the luminosity level.
For output modalities, a check is also necessary so that the system knows which modalities can be used when sending the final result to actuators. We have defined four selection queries according to handicap type: light, noise levels and weather conditions. Knowing that the environment is noisy, the result of the available output modalities is presented in Fig. 5. As such, the output modality which will be deactivated is the speaker. This implies that the available modalities are the moving arm, the wheels and the screen.
Fusion models defined in the Ontology
Fusion models defined in the Ontology

Fusion Model 4.
The fusion engine is an essential component of our system: it allows the merging of information obtained from the environment, such as the modalities and contextual information. When the user makes a request, the system detects the events and merges them to offer a service requested by the user. To do so our fusion engine uses the WWHT model [22] for the multimodal interaction. Indeed, the WWHT model is composed of four questions:
What is the information to render?
Which model should we utilize to present this information?
How to roll out the information applying these modalities?, and
How to manage the progress of the deriving presentation? [23].
In our case, the answers of the questions are:
“What” – refers to what the system has detected. If it refers to, for example, the available input modalities, for instance, then is it a voice modality or a gesture?
“Which” – the system will decide which model the fusion engine should use for merging input modalities.
“How” – the system will merge the information from input modalities using the fusion model selected from the pre-defined models stored in the ontology.
“Then” – selects the appropriate fission solution and sends the fusion result alongside the corresponding fission solution.
Before proceeding with fusion, the system needs to go through various different stages. First, the system must do semantic check. Using ontology, we have to check consistency by checking the relation between its components. This can be done by using the inference engine Pellet and the Jess plug-in of Protégé. The Pellet engine will check the inferences, the taxonomy and the consistency among classes in the ontology and the Jess engine will check the SQWRL queries defined in the ontology.
Secondly, the system also checks the presence of the detected events in the predefined vocabulary in the ontology. This checking allows the system to reject any undefined event that will not be used by the fusion engine, such as, for example, the sound of a door or a television. Thirdly, the order of the events will be checked. To do so, the events obtained from different modalities will be checked using SQWRL queries.
We have defined nineteen models, presented in Table 1, that describe examples of requests made by the user. Each model has its own query defined in the SWRL tab of Protégé. For instance, the command “Take me there” will be classified as being an example of a command that satisfies Model 4 of Table 1. The models have been defined as classes in the ontology and each class is composed of subclasses linked by object properties. The subclasses of the models are classes already defined in the ontology. For example, the subclasses of Model 4 are words for tracking, personal pronoun and location which are also the subclasses of vocabulary and surroundings respectively. Furthermore, we have defined 19 object properties that link the subclasses of each model to allow the definition of the order of events.
Fission solution defined in the Ontology
Fission solution defined in the Ontology
When detecting an event, each part of the event has to be an instance of the subclasses of the model. Given the example cited above, the system has to first detect the word “take” as an instance of words for tracking, then the word “me” as an instance of personal pronoun and finally the word “there” as an instance of location. The order of the events has to be satisfied and match the order of Model 4. Otherwise the event will be rejected as it does not match any predefined model of the ontology. An example of the class Model 4 is presented in Fig. 6.
Finally, the time checking allows the verification of time between two successive modalities and the full command time. Indeed, the time between two modalities has to be less than the maximum pre-defined time allowed (5 seconds) and the full time of a command has to be below the maximum allowed pre-defined command time (15 seconds).
We have chosen to introduce this checking mechanism so that the system will not wait indefinitely for commands from the user. By completing these four stages, the system will be able to recognize the events obtained from the environment and make the fusion according to the correct pre-defined model in the ontology.
The fission engine will subdivide action suitable to the results of the fusion engine into elementary subtasks and sends them to available output modalities. The subdivision is made according to predefined patterns in the ontology. These basic commands will be sent to the outputs modalities based on their availability. Fission patterns are defined in the ontology by two parts, namely a problem and a solution. The problem is the command issued by the user and obtained from the fusion engine while the solution is composed of all possible sub-tasks for that command. Furthermore, the output modalities chosen are: Screen, Speaker, Wheels and Arm. These will be deactivated according to the state of the context. Similar to the fusion engine, the WWHT model will be used for the fission engine. This will allow the fission engine to subdivide the results of the fusion engine by answering the question “What”, “Which”, “How” and “Then”. By answering these questions the fission process is completed:
“What” – refers to what has the system detected. It will be the result of the fusion engine.
“Which” – the system decides which solution the fission engine should use for subdividing the action corresponding to the fusion result.
“How” – the system will subdivide the desired action using the fission solution selected from the predefined solutions stored in the ontology.
“Then” – the system will select the appropriate output modalities depending on the context information. For instance, if the user is deaf, then the sound modality will be deactivated.
For the fission process, we have defined nine possible solutions in our ontology. These solutions are presented in Table 2. Each solution is a class of the ontology and their instances are connected using objects properties to define their order of execution. For instance, Fission Solution 1 is the solution for Models 1, 6 and 11 of the fusion engine. For example, the request “Give me cup” will be answered by the following sub-tasks in sequential order: Move to object position → Grab object → Move to people position → Drop object.
Case study
Here, we will present an example of commands made by the user and how the system deals with it and merges the information using the fusion engine. The scenario chosen to test our fusion engine is “Give Her Some Juice”. In this case, the system has to understand the meaning of the phrase, find a model that corresponds to it, merge the information, and take into account the maximum time allowed between the modalities and the full command time. Then the system has to subdivide the action into elementary sub-tasks and send the command to output modalities using the fission engine. We highlight that when asking for juice, the system has to understand that the juice has to be brought in an appropriate container, such as a glass or a bottle.
After launching the Pellet reasoning engine in the Protégé and having the check completed with no problems detected, we start the test of the fusion engine. We assume that the system detected the inputs described in Table 3 (answering the question “What”).
When detecting inputs, the fusion engine checks the presence of these inputs in the ontology model. It then yields the classes where these inputs are defined as individuals, as shown in Fig. 7. Back to the cited case, we note that the fusion engine has recognized the words: “give”, “juice”, “her” and the gesture sensor information that detected the position of “her” and gives us the location in different classes of the ontology. The word “ding”, for example, is a sound emitted by the television and is detected by the system but is rejected because it is not found in any class within the ontology.
Example of a user’s command
Example of a user’s command

Vocabulary verification result.
Furthermore, let us assume that the wheelchair user has a hearing problem, as a health issue. Assume further that the luminosity level is higher than the minimum acceptable value and that the noise level is less than the maximum acceptable value. The battery is also fully charged and the user is in a good health situation (blood pressure and temperature are good). In this case, the modalities that will be deactivated are only those affected by the hearing problem, which is the output modality “Speaker”, all the other inputs and outputs modalities are available. See Fig. 8.

Output modalities availability result.
For the question “Which”, the system will launch a query to find the corresponding model in the ontology and as per the fusion engine, will find the pre-defined model of the command “Give Her Some Juice” that is represented by the following order: Words convenient objects → People → Liquid.
For the question “How”, the fusion engine will merge the input modalities using the pre-defined models. The fusion engine understood that the liquid requested has to be brought in an object used for liquids. The result of the query is presented in Fig. 9. The fusion engine has recognized the model as being Model 19 of the ontology and that the answer will be of the model: Words convenient objects → People → Object used for liquid → Liquid. In fact, the fusion engine has recognized the model and added the objects: bottle, glass or jug that can be used for the juice.

Fusion result and time verification.
Finally, for the question “Then”, the fusion engine will send the fusion result and the corresponding fission solution after making time verification. The result is given in Fig. 9, where Txy and Tyz are the times between the successive modalities “Give”, “Her” and “Juice” respectively, and Full-Time-C is the time of the full command. We notice that the modalities time is less than 5 seconds (the maximum time allowed), and the full command time is less than 15 seconds (the maximum command time allowed), so the time condition has been satisfied.
Now that the fusion engine merged the information collected from the environment and has understood the user’s request, the system needs to send this information to available output modalities to complete the action by using the fission engine. Going back to Fig. 9, when merging the information, the fusion engine concludes that the fission model corresponding to our example is the Fission Model 7. The result is presented in Fig. 10. The fission result is: move to a position, grab a liquid object, pour the object to a container, move to person’s position and then drop the object.

Fission result.
When using the WWHT model for the fission engine, the “What” question referred to the information obtained from the fusion engine. Also, as shown in Fig. 9, the fusion engine has not only merged correctly the input modalities but also answered the question “Which” by giving the fission solution that has to be used for the fission process, in our case it is the Fission Solution 7.
For answering the question “How”, the system has to find the corresponding fission solution and run it. We have defined nine solutions, presented in Table 2, that describe fission examples according to the fusion process. The fission engine will subdivide the result using a predefined solution defined in the ontology. Each solution has its own query defined in the SWRL tab of Protégé. When executing this query, the system subdivides the results of the fusion engine into elementary sub-tasks. The result is presented in Fig. 10. For instance, the solution for the example “Give her juice” will be:
For the question “Then”, the fission engine will check the context by using the SQWRL query that will select the output modalities and decides which output modality can be used according to the state of the context. Recall that we assume that the wheelchair user has just hearing difficulties and all the other values are in the accepted ranges. In this case, only the vocal output modality (speaker) will be deactivated. The SQWRL query used is as follows:
The result is presented in Fig. 8. The selection of output modalities is done by checking the state of the context. The system continuously retrieves context information from the environment and compares it with the modalities’ selection conditions in the ontology. If a modality deactivation condition is met, the system will deactivate the corresponding output modality. Finally, by answering the four questions cited earlier, the system is able to send the subdivided tasks to the corresponding output actuators. This means that the fusion and the fission processes are completed and the user request “Give her juice” is answered.
We conclude that all the conditions are satisfied. First, the semantic was satisfied. Then, by answering the question of the WWHT model, the vocabulary check, modality selection and the order of the events have been satisfied. Finally, the time verification has been satisfied. For these reasons the fusion has been done by merging the detected events. The fission engine has subdivided the fusion result into elementary sub-tasks using the pre-defined solution stored in the ontology. With that, the fission process is completed and the user request “Give her some Juice” is answered by using the command “Give Her a Glass of Juice”.
In this section, the detailed validation of our multimodal human–machine interaction system architecture is presented.
The importance of system validation
The aim of our paper is to develop a theoretical foundation for ambient human–machine–environment interaction in order to formally validate our proposed architecture. The formal validation proves that our method is effective in the simulation environment as well as in real environment. The model of the robotic environment is integrated into the ontology and the specialized concepts of Agent, Role, and Activities are also defined. All instances called “facts”, both from concepts and events, in the case of a real robotic environment, are stored in the ontology. In a real robotic environment, it is easy to program the services of the concentrator to deal with sensors and actuators used in the desired robotic application.
However, in an interactive architecture, like the one proposed in this paper, one can also accurately receive the necessary inputs and provide the necessary outputs using a concentrator as is proven in our previous work [8]. What remains to be validated are the functionalities of the important processes of our architecture which are the fusion and fission using ontology. This is the aim of this paper: to prove the reliability and suitability of our architecture for real robotic applications. This is done via:
The use of OWL – Ontology formalism allows us to verify the consistency of knowledge representation of the robotic environment. The instance checking is realised using PROTEGE for real robotic application. The mechanism of reasoning is checked using the rules of fusion and fission. The use of the stochastic Colored Petri Nets allows us to model processing time, following some laws of probability. We execute the applications in several scenarios of different models in the ontology using different sequences of events generated and by checking the recognition of composite events related to existing models.
The simulated environment
The validation of our work is essential to prove that the functional requirements of the system have been achieved. To this end, we have chosen to model our system using CPN Tools for colored Petri nets [12]. The execution of Petri net is non-deterministic. Given the non-deterministic nature of the Petri net and that multiple transitions and events may take place at any given time, then Petri nets are well suited for modeling concurrent behavior of distributed systems and a system like ours.
This tool allows modeling and visualization of architectures using the formalism of Colored Petri Nets. To understand the mechanism of our fusion and fission engines, we will use three scenarios of potential user’s requests and explore the behavior of our architecture. These scenarios are:
Words for tracking → Personal pronoun → Location (take me there), Words convenient object → People → Convenient object (bring me spoon), and Words convenient objects → People → Liquid (give her juice),
which are Models 4, 11 and 19 respectively in the fusion models and have the solution 1, 4 and 11 for the fission process. Figure 11 shows the general view of the architecture.

Multimodal system architecture, a general view.
The Colored Petri Net of our system, presented in Fig. 11, is composed of ten essential modules:
Input Modality: this module is responsible for sending a random combination of words to the system. This combination will be a command issued by a user. As shown, we have defined nine words: bring, give, take, her, me, ding, spoon, juice and there. By sending the words randomly, we can have both a correct combination and a combination that does not mean anything. This ensures that the system is recognizing the correct models and will send a feedback for the wrong combinations and rejects it.
Ontology: this is the knowledge base of the system defined earlier in this paper.
Vocabulary checker: This module will go back to the ontology and check the presence of the received vocabulary. If the word is found, this module will retrieve its model from the ontology. Then it will send the model alongside with the word to other modules. The result will be sent to the ontology to proceed with context verification and modality selection.
Time check: this module checks the time between modalities and the command time. If the time is accepted the process will go on, otherwise the process will stop and a feedback will be sent.
Model building: when the word has been recognized by the vocabulary checker, the name of its class has been added. This module will be responsible for linking the names of the words’ classes for each example. For instance, we can obtain the combination “words for tracking, person and location”. This is done so that the system can compare them to the models defined in the ontology before the fusion process.
Model recognized: This module will search in the ontology for a model. If a model is recognized, it will be sent further in the architecture; otherwise a feedback will be sent.
Fusion engine: this module is responsible for the fusion. In fact, the input modalities will be merged according to models from the ontology.
Fusion result and fission patterns: This module will send the result of the fusion engine alongside its corresponding fission solution.
Fission engine: this module will subdivide the fusion result into elementary sub-tasks according to fission solutions obtained from the ontology.
Output context verification: this module will check the availability of output modalities according to the state of the context and send the final answer to output actuators.
The Input modality module is shown in Fig. 12. As shown, we have nine words in the place “Input” that will be sent randomly to form a random example. The example will be formed by three words. This model building is limited by the number mMax. Also when sending a word to build an example, a time is randomly generated and associated with each word. The time between the first and the third word will be the full command time and will be checked later with the modality time. The word, the number of the example and the time will be sent to the place “Vocabulary verification”. This place will send the information to compare it with the knowledge base stored in the ontology.

Input modality transition.
Indeed, as shown in Fig. 13, eight possible words from the input are declared alongside their membership class in the ontology. For instance, the place “wco1” and “wco2” are for the class “words convenient object” that contains the words “Bring” and “give” respectively. The place “wft” is for the class “words for tracking” and contains the word “Take”.

Ontology part responsible for the vocabulary check.
The place “L” is for “Location” and contains the word “There”. “PP” and “PS” are for “Personal pronoun” and “Persons” and have the words “me” and “her” respectively. Finally, the place “co” and “lq” are for “convenient object” and “liquids” and contains the words “spoon” and “juice” respectively.
Hence, when an example has been formed and sent from the input, the system will have to check the ontology for the defined vocabulary and compare it with the example. The system will compare the word from the example with the word of the different places declared earlier. If the word matches one of the places, the name of the class will be added to the word and sent for further treatment. Otherwise the word will continue to be checked until the right class is found. We highlight that the word “Ding” declared in the input does not have a meaning (it can be a sound emitted by a television and detected by the system). So, if an example contains the word “Ding”, it will be rejected. The place “Vocabulary from ontology” will send the stored vocabulary to the module “Vocabulary checker” shown in Fig. 14 to make the comparison.

Vocabulary checker transition.
If a word in a given example is not found in the ontology, for instance the word “Ding”, this word will be rejected and sent to the place “Rejected”. This will induce the rejection of other words that formed that example to the place “Rejected Exp” because the example will not have a meaning with two words since, as we defined previously, an example is composed of three words. Then, the result of the recognized example will be sent to the place “Context verification” that is linked to the ontology to check the state of the context and selects the available input modalities.
The part of the ontology module responsible for the modality selection is presented in Fig. 15. It shows the context definition and the modality selection. Here, we have defined the place “H” for the handicap type, the place “Alr” for the alarm, the place “LL” for light level and the place “NL” for the noise level. The corresponding places for the output context are respectively “HF”, “AlrF”, “LLF” and “NLF”. Based on the values of this context information, the input and output modalities will be selected. When an example arrives at the place “context verification”, the context information will be retrieved from the ontology and checked. For instance, if the detected command is a word (voice) and the noise level retrieved from the ontology is higher than the accepted value, the voice modality will be dabbled and the example rejected. For the purpose of this validation, we have chosen to define all the information in the accepted range so as to focus on the execution of the fusion and the fission engines.

Ontology part responsible for input and output modality selection according to the context.
After the selection of modalities and the vocabulary verification, the system has to check the time of modalities and the full command time.
This check is done so that the system will not wait indefinitely for an input to arrive. If the time between two modalities is longer than 5 seconds, or the full command time is longer than 15 seconds, then the example will be rejected and sent to the places “MTime Rejected” and “CTime Rejected”, respectively. The module responsible for this is presented in Fig. 16.

Time verification.

Model building.
When an example arrives, the time of the modalities will be retrieved and stored so that the system can make a subtraction between two successive words for modality time and between the third and the first word for the full command time. If the time is accepted, the example will be sent to the output place of this module, the “Ontology Model”, for further treatment. Figure 17 shows the module “Model building”. In this validation, for the system to be able to recognize a model, it needs to compare the model of the example with the ones stored in the ontology, knowing that when a word was recognized in the “Vocabulary recognition” module, its membership class was provided. The “Model building” part will select the class of each word of an example and associate it with the next word’s class of the same example to form a model, for example “wco p co” (words convenient objects, people, convenient object) and compare it with the predefined models of the ontology, as shown in Fig. 18. When the model is recognized, the example will be sent to the fusion engine in Fig. 19 to merge the information and conclude the corresponding fission solution.

Ontology part for fusion models definition.

Fusion engine.
Now that the system has recognized the model and the fusion was successful, the result of the fusion engine has to be subdivided using the fission engine. The system knows the fission solution that corresponds to the fusion process. Accordingly, the system has to refer to the ontology and retrieve the correct fission solution stored in it, as shown in Fig. 20.

Ontology part of fission solutions definition.
Figure 21 shows the fission solutions of our three scenarios stored in the ontology. As stated earlier in the fusion engine, if the user requests a liquid, the system understands that an object used to hold liquid is important and has to be considered when fusing the inputs.

Fission engine.
Moreover, in the fission process, the state of the door is defined in the ontology. In fact, when a request includes moving from a room to another, passing through a door, we have to acknowledge that for the wheelchair user, it has to open the door or just pass through it if it is already open. If we do not include this, the robot will stop because the closed door will be detected as a barrier. For instance, if the request is “bring me spoon” knowing that the wheelchair user is in the living room and the spoon is in the kitchen with a closed door, the fission solution will be the Solution 4 that is: “move to object position”, “open door”, “grab object”, “move to people position”, “drop object”.
In Fig. 21, we find the fission engine that provides the final fission solution and the type of the given solution. The types are: “Moving” if the request includes moving from a place to another, “Visual” if the request includes a screen answer or “Vocal” if the answer demands the use of a speaker. Finally, the module output context verification will select the available output modality according to the output context state presented in Fig. 15, and sends the final result to the place Command as shown in Fig. 22.

Rejected vocabulary.
We have launched the simulation for three random examples and the answer is as follows: (Number of example, position of the word in the example, the fusion result, type of output, Fission solution number, Sub-task from the fission). For the first example, the simulation generated the example “Ding her spoon” which is not a valid command, so the system has rejected this example as shown in Fig. 22. The simulation generated “bring me cup juice” for the second example and “bring me spoon” for the third. As shown in Fig. 23, our architecture has successfully merged the input modalities and subdivided the results into elementary sub-tasks after checking all the conditions.

Final result.
Furthermore, to visualize the behavior of our architecture with more inputs, we generated 33 examples. The results of the first and the second simulation are shown in Fig. 24(a) and 24(b) respectively. In Fig. 24(a), five examples have been accepted and processed by the system. Ten examples have been rejected because the model did not match a model defined in the ontology. Seven and four examples have been rejected because the command time and the modality times exceeded the maximum command and modality times respectively.

(a) First simulation results. (b) Second simulation results.
Finally, seven examples were rejected because the vocabulary used was not defined in the ontology, i.e. ‘Ding’. For the second simulation (Fig. 24(b)), eleven examples have been accepted. Six examples were rejected because they did not match any predefined model of the ontology. Also, three and six examples have been rejected because of the command time and the modality time. And finally seven examples have been rejected because the vocabulary did not match the predefined vocabulary of the ontology.
Based on these simulations, we conclude that our system is able to understand input modalities, reject commands that do not match the pre-conditions and assure the fusion and the fission process. We have demonstrated that our proposed architecture is an autonomous system that is able to understand the environment and decide by answering the user’s request using the fusion and the fission engines while taking into consideration the context and using the concept of ontology to build a knowledge base.
In this paper, we have presented our multimodal architecture applied to react to the command of a user of a wheelchair with a manipulated arm. This architecture is for the interaction between the robot (wheelchair) and the user. To validate our approach in a real-world robot, we have chosen to apply a simple scenario using the robot Nao. Nao is a humanoid and not a wheelchair. We have decided to validate our approach by using this robot because it is already functional in our laboratory. The wheelchair robot is an ongoing work of another laboratory team and is still under constant evolution.
The validation using Nao also proves that our architecture, initially designed for a wheelchair robot, can be adapted to any kind of robot for the purpose of human–machine assistance interaction.
Technical description
To undertake some scenarios, we use wrappers and a driver that drives the hardware. The driver reads and integrates network events to act on the hardware and read from it. The wrappers take several data and transform them into one or multiple events. The network is able to handle events that have a predicate form with arguments (move (object, x, y)).
A wrapper has a model list. For an input example, the wrapper adds sensors data in the events model and sends the event to the fusion agents. For the output example, the wrapper receives an event from an agent (fission agents), retrieves data from the event to drive actuators, sending a text or voice message to the actor/s involved directly or through network communication. See Fig. 25.
Scenario
Here, the scenario that is going to be presented is that of the robot reminding a person to take his medication.
The task of the robot Nao is to look after an elderly person named Amar. The robot receives the coordinates of the person’s position. It interprets the person’s speech and gestures through Kinect, hence being able to recognize the user’s state (i.e. active, sleeping, watching TV, moving the arms, walking, standing, talking). See Fig. 26.

Relation between architecture’s elements.

Kinect: visual and audio sensor. Nao: gestural and vocal actuator.
Here, the Kinect used is one with speech recognition events, gestures and objects recognition. The wrapper is in the computer which is connected to the Kinect via a USB; it uses the Ethernet network to communicate events to agents that are responsible for the fusion and fission processes.
The robot Nao receives events via Wi-Fi (Ethernet network). A wrapper found inside the robot uses the NaoQi broker developed by Aldebaran to give orders to the robot (voice synthesis, gestures, walk, stop, object recognition and face recognition). The robot also accesses the medical records (e.g. illnesses, such as diabetes) stored in the computer of the patient through Wi-Fi connection. See Fig. 27.
Nao is also capable of obtaining other pertinent information about the user using other forms of modalities. For example, the person’s heartbeat can be obtained via LG Urban watch attached to the user’s wrist. Through the person’s Google Agenda calendar, one would be able to get the time when the sick person must take his medication. It is given that the medical diary is pre-filled by competent medical personnel. Hence, let us assume that the robot knows that Amar has to take medication at 10:00 a.m. and notices that Amar is lying on the armchair in a sleeping state.
Given all these inputs information, the fusion engine combines them altogether and it yields a desired action: remind Amar to take his medication. The fission engine sends the result to Nao who will then send a vocal reminder to Amar. The scenario is if Amar is not answering and the Kinect indicates that the subject’s state is “sleeping”, Nao will have four possible reasons: (i) Amar is awake but deaf; (ii) Amar is dozing; (iii) Amar is in deep sleep; or (iv) Amar is in an emergency state and needs medical help.
The robot stays in its position and utters “Amar, it is time to take your medication”. If there is no reaction from Amar, the robot tries again with another modality which is the gesture by moving its arm to make a sign to Amar (see Fig. 28). If Amar is still not responding to this, then the system gets a new context (i.e. a new situation), sends the new input information to the fusion engine. The new context that needs new action is then handled by the fission engine. As such, the robot approaches Amar’s position. Here, Nao will try to wake him up by touching his leg (Fig. 29).
The next scenario will depend on the person’s responses to the robot’s action. If the subject wakes up, then Nao will remind him to take his medication via voice message. If the subject has no reaction, the robot sends an emergency message to the medical personnel (through the network) asking him/her to come and see what has happened.
In this paper, we have presented our multimodal human–machine interaction system. The system makes the interaction easier by allowing the user to use the person’s natural way of conveying information– speech and gesture. The multimodal fusion and fission engines are intelligent, being aware of the user’s contextual information. The intelligence of the fusion and fission engines are based on the mechanism of pre-conditions stored in the ontology; there are event models that are stored in the knowledge base and the reaction of the system is based upon these models.

Events communication.
Ontology is used as a tool to represent knowledge. It allows full description of the human–machine environment and assures a common understanding of its entities, assuring the reusability of the knowledge stored in the system whenever it is needed. The ontology was defined using classes, properties and individuals that described the environment of our case study and is open for adaptation to any other example by adding or retrieving information from it. Furthermore, we have adapted the WWHT model to handle human–machine interaction based on action modalities, such as moving a wheelchair and actuating a robotic arm.

Nao using gesture to inform Amar that he has to take his medications.

Nao touching the subject to wake him up.
We validated our approach using formal specifications by using CPN Tools for colored Petri nets. The execution of Petri net is non-deterministic. Given the non-deterministic nature of the Petri net and that multiple transitions may take place at any given time, then Petri nets are well suited for modeling concurrent behavior of distributed systems. The simulation using Petri nets proved that the system is able to recognize input events and responds to a user’s request using different modules and sends the unimodal tasks to output modalities. The simulation also demonstrated that our system is able to detect a false input and reject the commands that it does not understand, reject an event if the time involved is not acceptable. Hence, by using colored Petri nets, we were able to model and validate the temporal requirements of our system in addition to confirming the attainment of the system’s functional requirements.
For the purpose of validating this architecture on a real-world robot application, we made use of the robot Nao along with Kinect for testing a specific scenario. The fusion engine was able to merge all inputs received from the environment. The fission engine was able to send Nao different actions to do in this scenario, basically centered on reminding the subject to take his medication. With suitable adaptation, our architecture may be used for various kinds of assisting a handicapped or an elderly. We believe that our proposed architecture is a worthy contribution to the advancement of research in human–machine interaction intended to assist the needy.
