diff --git a/voice interaction drafts/paArchitecture/paArchitecture-1-3.htm b/voice interaction drafts/paArchitecture/paArchitecture-1-3.htm index 7a18263..0e2aad0 100644 --- a/voice interaction drafts/paArchitecture/paArchitecture-1-3.htm +++ b/voice interaction drafts/paArchitecture/paArchitecture-1-3.htm @@ -1,639 +1,1164 @@ - + - Intelligent Personal Assistant Architecture - - +Intelligent Personal Assistant Architecture + +
-

- W3C

+

+ W3C +

-

Intelligent Personal Assistant Architecture

-

Architecture and Potential for Standardization Version 1.3

+

Intelligent + Personal Assistant Architecture

+

Architecture and + Potential for Standardization Version 1.3

Latest version
-
Last modified: March 21, 2023 https://github.com/w3c/voiceinteraction/blob/master/voice%20interaction%20drafts/paArchitecture/paArchitecture-1-3.htm (GitHub repository)
-
HTML rendered version
+
+ Last modified: March 21, 2023 https://github.com/w3c/voiceinteraction/blob/master/voice%20interaction%20drafts/paArchitecture/paArchitecture-1-3.htm + (GitHub repository) +
+
+ HTML + rendered version +
Editors
-
Dirk Schnelle-Walka, modality.ai
- Deborah Dahl, Conversational Technologies
+
+ Dirk Schnelle-Walka
Deborah Dahl, Conversational + Technologies +
- - -
+ +

Abstract

-

This document describes a general architecture of Intelligent Personal - Assistants and explores the potential for standardization. It is meant - to be a first structured exploration of Intelligent Personal Assistants - by identifying the components and their tasks. Subsequent work is - expected to detail the interaction among the identified components and - how they ought to perform their task as well as their actual tasks - respectively. This document may need to be updated if any changes - result of that detailing work. - It extends and refines the description of the previous versions - Architecture and Potential for Standardization Version 1.2. - The changes primarily consist of clarifications and additional - architectural details in new and expanded figures, include input and - output data paths. +

+ This document describes a general architecture of Intelligent + Personal Assistants and explores the potential for + standardization. It is meant to be a first structured + exploration of Intelligent Personal Assistants by identifying + the components and their tasks. Subsequent work is expected to + detail the interaction among the identified components and how + they ought to perform their task as well as their actual tasks + respectively. This document may need to be updated if any + changes result of that detailing work. It extends and refines + the description of the previous versions Architecture + and Potential for Standardization Version 1.2. The changes + primarily consist of clarifications and additional architectural + details in new and expanded figures, include input and output + data paths.

Status of This Document

-

This specification was published by the - Voice Interaction Community Group. - It is not a W3C Standard nor is it on the W3C Standards Track. - Please note that under the - W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups.

-

Comments should be sent to the Voice Interaction Community Group public mailing list (public-voiceinteraction@w3.org), archived at https://lists.w3.org/Archives/Public/public-voiceinteraction

- -

Table of Contents

-
    -
  1. Introduction
  2. -
  3. Problem Statement
  4. -
  5. Architecture -
      -
    1. Client Layer -
    2. Dialog Layer
    3. -
    4. External Data / Services / IPA Providers Layer
    5. -
  6. -
  7. Error Handling
  8. -
  9. Use Case Walk Through
  10. -
  11. Potential for Standardization
  12. -
  13. Footnotes -
  14. Appendix -
      -
    1. Acknowledgments
    2. -
    3. Abbreviations
    4. -
  15. -
- - -

1. Introduction

-

Intelligent Personal Assistants (IPAs) are now available in our daily lives through our smart phones. Apple’s Siri, Google Assistant, Microsoft’s Cortana, Samsung’s Bixby and - many more are helping us with various tasks, like shopping, playing music, setting a schedule, sending messages, and offering answers to simple questions. Additionally, we equip our households - with smart speakers like Amazon’s Alexa or Google Home which are available without the need to pick up explicit devices for these sorts of tasks or even control household appliances in our homes. - As of today, there is no interoperability among the available IPA providers. Especially for exchanging learned user behaviors this is unlikely to happen at all.

-

- Furthermore, in addition to these general-purpose assistants, there are also specialized virtual assistants which are able to provide their users with in-depth information which is specific to an enterprise, government agency, school, or other organization. - They may also have the ability to perform transactions on behalf of their users, such as purchasing items, paying bills, or making reservations. Because of the breadth of possibilities for these specialized assistants, it is imperative that they be able to - interoperate with the general-purpose assistants. Without this kind of interoperability, enterprise developers will need to re-implement their intelligent assistants for each major generic platform. -

- -

This document is a first step in our strategy for IPA standardization. It describes a general architecture of IPAs and explores the potential areas for standardization. It focuses on voice as the major input modality. - We believe it will be of value not only to developers, but to many of the constituencies within the intelligent personal assistant ecosystem. Enterprise decision-makers, strategists and consultants, and entrepreneurs may study this work to learn of best - practices and seek adjacencies for creation or investment. - The overall concept is not restricted to voice but also covers purely text based interactions with so-called chatbots as well as interaction using multiple modalities. - Conceptually, the authors also define executing actions in the user's environment, like turning on the light, as a modality. - This means that components that deal with speech recognition, natural language understanding or speech synthesis will not necessarily be available in these deployments. In case of chatbots, speech components will be omitted. In case of - multimodal interaction, interaction modalities may be extended by components to recognize input from the respective modality, transform it into something meaningful and vice-versa to generate output - in one or more modalities. Some modalities may be used as output-only, like turning on the light, while other modalities may be used as input-only, like touch.

- -

2. Problem Statement

- -

Currently, users are mainly using the IPA Provider that is shipped with a certain piece of hardware. Thus, selection of a smart phone manufacturer actually determines which IPA implementation - they are using. Switching among different IPA providers also involves switching the manufacturer, which requires high costs and getting used to a new user interface specific to - the new manufacturer. - On the one hand users should have more freedom in selecting the IPA implementation they want. However, they are bound to use the service that is available in that implementation but which may not be what they necessarily prefer. - On the other hand, IPA providers, which mainly produce the software, must also function as hardware manufacturers to be successful.

-

Moreover, we are also seeing the emergence of independent conversational agents, owned and operated by independent enterprises, and built on either white label platforms or of best-of-breed components by 3rd party development agencies. This may largely free IPA development from hardware. Such a market transition creates an ever greater impetus for this work. -

-

Finally, manufacturers also have to take care to port - existing services to their platform. Standardization would clearly lower the needed efforts for porting and thus reduce costs. Additionally, it may also pave the way for interoperability among available IPA providers. - Tasks may be transferred, partially or completely to other IPAs.

- -

In order to explore the potential for standardization, a typical usage scenario is described in the following section.

- -

2.1 Use Cases

-

This section describes potential usages of IPAs.

- -

2.1.1 Travel Planning

-

A user would like to plan a trip to an international conference and she needs visa information and airline reservations. She will give the intelligent personal assistant (IPA) her - visa information (her citizenship, where she is going, purpose of travel, etc.) and it will respond by telling her the documentation she needs, how long the process will take - and what the cost will be. This may require the personal assistant to consult with an auxiliary web service or another personal assistant that knows about visas.

- -

Once the user has found out about the visa, she tells the IPA that she wants - to make airline reservations. She specifies her dates of travel and airline preferences and the IPA then interacts with her to find appropriate flights.

- -

A similar process will be repeated if the user wants to book a hotel, find - a rental car, or find out about local attractions in the destination city. - Booking a hotel as part of attending a conference could also involve finding out about a designated conference hotel or special conference rates, which, again, could require interaction with the hotel or the conference's IPA's.

- -

2.1.2 Emergency Events

-

User encounters emergency situations that requires them to use their hands while administering medical care, driving or operating machinery. Manual interactions on control panels, keyboards or touch pads can impede life saving activities and diminish focus while operating sensitive vehicles, devices and machinery. User would benefit from a secure, interoperable, voice interactive system that can be used to access necessary information, keeping hands free to perform these actions.

- -

Examples of emergency applications include:

- - - -

All of these use cases benefit from voice interaction systems that have:

- - - -

Interoperability:

- - - -

2.2 Roles and Responsibilities

- -

The following roles and responsibilities following the RACI - (responsible, accountable, consulted, informed) are identified

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
RoleRACI
Platform providerxx
Content Ownerxx
Developerxx
Designer and Application Developerx
System Integratorx
User
+

+ This specification was published by the Voice + Interaction Community Group. It is not a W3C Standard + nor is it on the W3C Standards Track. Please note that under + the W3C Community + Contributor License Agreement (CLA) there is a limited + opt-out and other conditions apply. Learn more about W3C Community and + Business Groups. +

+

+ Comments should be sent to the Voice Interaction Community Group + public mailing list (public-voiceinteraction@w3.org), archived + at https://lists.w3.org/Archives/Public/public-voiceinteraction +

-
-
Platform provider
-
Accountable and responsible for the operative performance of the - infrastructure (uptime, security, performance as measured - against service-level agreements (SLAs) with clients, customers, - and partners, inclusive of on-premises hardware and cloud - services.
-
Content Owner
-
Accountable for the UX, content, and operational performance of - any and all assistants that represent the brand and its services - to brand constituents (including clients, customers, and - internal stakeholders).
- Example: a financial services enterprise, such as - a bank
-
Developer
-
Responsible to the content owner for the - - Example: Most often, an independent enterprise specializing in - conversational assistance.
-
Designer and Application Developer
-
Responsible to the content owner for - -
-
System Integrator
-
Responsible to content owner for - -
-
User
-
Uses the IPA
-
- -

3. Architecture

- -

In order to cope with such use cases as those described above an IPA follows the general design concepts of a voice user interface, as can be seen in Figure 1.

- -

The architecture described in this document follows the SOLID principle - introduced by Robert C. Martin to arrive at a scalable, understandable and reusable software solution.

-
-
Single responsibility principle
-
The components should have only one clearly-defined responsibility.
-
Open closed principle
-
Components should be open for extension, but closed for modification.
-
Liskov substitution principle
-
Components may be replaced without impacts onto the basic system behavior.
-
Interface segregation principle
-
Many specific interfaces are better than one general-purpose interface.
-
Dependency inversion principle
-
High-level components should not depend on low-level components. Both should depend on their interfaces.
-
- -
- Basic IPA Architecture -
Fig. 1 Basic architecture of an IPA
-
-

- This architecture follows a traditional partitioning of conversational systems, with separate components for speech recognition, natural language understanding, dialog management, natural language generation, and audio output, (audio files or text to speech). This architecture does not rule out combining some of these components in specific systems. -

- -

This architecture aims at serving, among others, the following most popular high-level use cases for IPAs

-
    -
  1. Question Answering or Information Retrieval
  2. -
  3. Executing local and/or remote services to accomplish tasks
  4. -
-

This is supported by a flexible architecture that supports dynamically adding local and remote services or knowledge sources such as data providers. Moreover, it is possible - to include other IPAs, with the same architecture, and forward requests to them, similar to the principle of a russian doll (omitting the Client Layer). - All this describes the capabilities of the IPA. These extensions may be selected from a - standardized marketplace. For the reminder of this document, we consider an IPA that is extendable via such a marketplace.

- -

Not all components may be needed for actual implementations, some may be omitted completely. However, we note them here to provide a more complete picture. - This architecture comprises three layers that are detailed in the following sections

-
    -
  1. Client Layer
  2. -
  3. Dialog Layer
  4. -
  5. External Data / Services / IPA Providers
  6. -
-

Actual implementations may want to distinguish more than these layers. The assignment to the layers is not considered to be strict so that some of the components may be shifted - to other layers as needed. This view only reflects a view that the Community Group regard as ideal and to show the intended separation of concerns.

- - -

3.1 Client Layer

-

The Client Layer contains the main components that interface with the user. The following figure details the view onto the Client Layer shown in Figure 1.

- - -

3.1.1 Capture

- -

Capture devices or modality recognizers are used to capture mutlimodal user input, such as voice or text input. Additional input modalities can be - employed that capture input with a specific modality recognizers. - Additional input may be gathered from Local Data Providers

- -
3.1.1.1 Microphone
-

The microphone is used to capture the voice input of a user as a primary input modality.

+

Table of Contents

+
    +
  1. Introduction
  2. +
  3. Problem Statement
  4. +
  5. Architecture +
      +
    1. Client Layer +
    2. Dialog Layer
    3. +
    4. External Data / + Services / IPA Providers Layer
    5. +
  6. +
  7. Error Handling
  8. +
  9. Use Case Walk Through
  10. +
  11. Potential for Standardization
  12. +
  13. Footnotes +
  14. Appendix +
      +
    1. Acknowledgments
    2. +
    3. Abbreviations
    4. +
  15. +
-
3.1.1.2 Keyboard
-

The keyboard may be optionally used to capture the text input if the IPA accepts this input modality.

+ +

+ 1. Introduction +

+

Intelligent Personal Assistants (IPAs) are now available in + our daily lives through our smart phones. Apple’s Siri, Google + Assistant, Microsoft’s Cortana, Samsung’s Bixby and many more + are helping us with various tasks, like shopping, playing music, + setting a schedule, sending messages, and offering answers to + simple questions. Additionally, we equip our households with + smart speakers like Amazon’s Alexa or Google Home which are + available without the need to pick up explicit devices for these + sorts of tasks or even control household appliances in our + homes. As of today, there is no interoperability among the + available IPA providers. Especially for exchanging learned user + behaviors this is unlikely to happen at all.

+

Furthermore, in addition to these general-purpose assistants, + there are also specialized virtual assistants which are able to + provide their users with in-depth information which is specific + to an enterprise, government agency, school, or other + organization. They may also have the ability to perform + transactions on behalf of their users, such as purchasing items, + paying bills, or making reservations. Because of the breadth of + possibilities for these specialized assistants, it is imperative + that they be able to interoperate with the general-purpose + assistants. Without this kind of interoperability, enterprise + developers will need to re-implement their intelligent + assistants for each major generic platform.

+ +

This document is a first step in our strategy for IPA + standardization. It describes a general architecture of IPAs and + explores the potential areas for standardization. It focuses on + voice as the major input modality. We believe it will be of + value not only to developers, but to many of the constituencies + within the intelligent personal assistant ecosystem. Enterprise + decision-makers, strategists and consultants, and entrepreneurs + may study this work to learn of best practices and seek + adjacencies for creation or investment. The overall concept is + not restricted to voice but also covers purely text based + interactions with so-called chatbots as well as interaction + using multiple modalities. Conceptually, the authors also define + executing actions in the user's environment, like turning on the + light, as a modality. This means that components that deal with + speech recognition, natural language understanding or speech + synthesis will not necessarily be available in these + deployments. In case of chatbots, speech components will be + omitted. In case of multimodal interaction, interaction + modalities may be extended by components to recognize input from + the respective modality, transform it into something meaningful + and vice-versa to generate output in one or more modalities. + Some modalities may be used as output-only, like turning on the + light, while other modalities may be used as input-only, like + touch.

+ +

+ 2. Problem Statement +

-

3.1.2 Presentation

-

Presentation devices or modality synthesizers are used to provide system output to the user. Additional output modalities can be employed that render their output - with a specific modality synthesizer. It is not always required that a verbal auditory output is made as a reply to a user. The user can also become aware of the output as a consequence of an observable action as a result - of a Local Service within the Client Layer or an External Services call from the External Data / Services / IPA Providers Layer. - In these cases an additional nonverbal auditory output may be considered.

- -
3.1.2.1 Speaker
-

The loudspeaker is used to output replies as verbal auditory output - in the shape of spoken utterances as a primary output modality. - Utterances may be accompanied by nonverbal auditory output such as

- - -
3.1.2.2 Display
-

The display may be optionally used to present text output if the IPA supports this output modality.

- -

3.1.3 IPA Client

-

Clients enable the user to access the IPA via voice with the following characteristics.

- - -
3.1.3.1 Client Activation Strategy
-

The Client Activation Strategy defines how the client gets activated to be ready to receive spoken commands as input. In turn the Microphone - is opened for recording. Client Activation Strategies are not exclusive but may be used concurrently. The most common activation strategies are described in the - table below

- - - - - - - - - - - - - - - - - - - - - - - - - -
Client Activation StrategyDescription
Push-to-talkThe user explicitly triggers the start of the client by means of a physical or on-screen button or its equivalent in a client application.
HotwordIn this case, the user utters a predefined word or phrase to activate the client by voice. Hotwords may also be used to preselect a known - IPA Provider. In this case the identifier of that IPA Provider is also used as additional metadata - augmenting the input - This hotword is usually not part of the spoken command that is passed for further evaluation.
Gesture-to-talkThe user triggers the start of the client by means of a gesture, e.g. raising the hand to be detected by a sensor.
Local Data ProvidersIn this case, a change in the environment may activate the client, for example if the user enters a room.
......
-

The usage of hotwords includes privacy aspects as the microphone needs to be always active. Streaming to the components outside the user's control should be avoided, hence detection of hotwords should ideally happen locally. - With regard to nested usage of IPAs that may feature their own hotwords, the detection of hotwords might be required to be extensible.

- -
3.1.3.2 Local Service Registry
-

A registry for all Local Services and Local Data Providers that can be accessed by the client -

-

- -

3.1.3 Local Services

-

Local services can be used to execute local actions in the user's local environment. Examples include turning on the light or starting an application, for instance a navigation system in a car.

- -

3.1.4 Local Data Providers

-

Local Data Providers capture input that is accessible in the user's local environment. They can be used to provide additional input to the IPA Client or - to provide additional information that is needed to execute services. An example for the latter is the state of the light, either turned on or turned off.

- -

3.2 Dialog Layer

-

The Dialog Layer contains the main components to drive the interaction with the user. The following figure details the high-level view of the Dialog Layer shown in Figure 1.

- - -

3.2.1 IPA Service

-

The general IPA Service API mediates between the user and the overall IPA system. The service layer may be omitted in case the IPA Client communicates directly with - Dialog Manager. However, this is not recommended as it may contradict the principle of separation-of-concerns. It has the following characteristics -

- -

3.2.2 ASR

-

The Automated Speech Recognizer (ASR) receives audio streams of recorded utterances and generates a recognition hypothesis as text strings for the local IPA. - Conceptually, ASR is a modality recognizer for speech. It has the following characteristics -

- -

3.2.3 NLU

-

An Natural Language Understanding (NLU) component that able to extract meaning as intents and associated entities from an utterance as text strings. -

-
Intent
-
An intent is a group of utterances with similar meaning.
-
Entity
-
An entity captures additional information to an intent.
-
- - The NLU component has the following characteristics -

- -

3.2.4 Dialog Manager

-

The Dialog Manager is a component that receives semantic information determined from user input, updates the dialog history, - its internal state, decides upon subsequent steps to continue a dialog and provides output, - mainly as synthesized or recorded utterances. Conceptually the dialog manager defines the playground that is used by the Dialogs - and contributes significantly to the user experience. - The Dialog Manager has the following characteristics -

-

- -
3.2.4.1 Dialog Strategy
-

A Dialog Strategy is a conceptualization of a dialog for an operationalization in a computer system. It defines the representation of the dialog's state and - respective operations to process and generate events relevant to the interaction. This specification is agnostic to the employed Dialog Strategy. Examples of - dialog strategy include

- - - - - - - - - - - - - - - - - - - - - - - - - -
Dialog StrategyExample
State-basedState Chart XML (SCXML): State Machine Notation for Control Abstraction
Frame-basedVoice Extensible Markup Language (VoiceXML) 2.1
Plan-basedInformation State Update
Dialog State TrackingMachine Learning for Dialog State Tracking: A Review
......
- -
3.2.4.2 Session
-

Dialog execution can be governed by sessions, - e.g. to free resources of ASR and NLU engines when a session - expires. Linguistic phenomena, like anaphoric references and - ellipsis, are expected to work within a session. Conceptually, - multiple sessions can be active in parallel on a single IPA - depending on the capabilities of the IPA. - The selected IPA Providers or the - Dialog Manager may have leading roles - for the task of session management.

-

A session begins when

- -

may continue over multiple interaction turns, i.e. an input and - output cycle, and ends

- -

This includes the possibility that a session may persist over - multiple requests.

- -

3.2.5 Context

- -

During the interaction with a user all kinds of information are collected and managed in the so-called conversation context or dialog context. - It contains all the short and long term information needed to handle a conversation and thus may exceed the concept of a session. - It also serves for context-based reasoning with the help of - the Knowledge Graph and to generate output for the output to the user NLG. It is not possible to capture - each and every aspect of what context should comprise as discussions about context are likely to end up in trying to explain the world. For the sake of this - specification it should be possible to deal with the following characteristics

- - -
3.2.5.1 History
-

The Dialog History mainly stores the past dialog events per user. Dialog events include users’ transcriptions, semantic interpretations and resulting actions. - Thus, it has information on how the user reacted in the past and knows her preferences. The history may also be used to resolve anaphoric - references in the NLU or can be used as temporary knowledge in the Knowledge Graph.

- -
3.2.5.2 Knowledge Graph
-

The system uses a knowledge graph, e.g., to reason about entities and intents. This may be received from the detected input from the - NLU or Data Providers to come up with some more meaningful data matching the current task better. - One example is the use of the name of a person as a navigation target as a person usually has an address that qualifies to be used in navigation tasks.

- -

3.2.6 NLG

-

The natural language generation (NLG) component is responsible for preparing the natural language text that represents the system’s output. - It has the following characteristics -

- -

3.2.7 TTS

-

The Text-to-Speech (TTS) component receives text strings, which it converts into audio data. Conceptually, the TTS is a modality specific renderer for speech. - It has the following characteristics -

- -

3.2.8 Dialogs

- -

Dialogs support interaction with the user. They include Core Dialogs, which are built into the system, and provide basic interactions, as well as more specialized dialogs which support additional functionality.

+

Currently, users are mainly using the IPA Provider that is + shipped with a certain piece of hardware. Thus, selection of a + smart phone manufacturer actually determines which IPA + implementation they are using. Switching among different IPA + providers also involves switching the manufacturer, which + requires high costs and getting used to a new user interface + specific to the new manufacturer. On the one hand users should + have more freedom in selecting the IPA implementation they want. + However, they are bound to use the service that is available in + that implementation but which may not be what they necessarily + prefer. On the other hand, IPA providers, which mainly produce + the software, must also function as hardware manufacturers to be + successful.

+

Moreover, we are also seeing the emergence of independent + conversational agents, owned and operated by independent + enterprises, and built on either white label platforms or of + best-of-breed components by 3rd party development agencies. This + may largely free IPA development from hardware. Such a market + transition creates an ever greater impetus for this work.

+

Finally, manufacturers also have to take care to port + existing services to their platform. Standardization would + clearly lower the needed efforts for porting and thus reduce + costs. Additionally, it may also pave the way for + interoperability among available IPA providers. Tasks may be + transferred, partially or completely to other IPAs.

+ +

In order to explore the potential for standardization, a + typical usage scenario is described in the following section.

+ +

+ 2.1 Use Cases +

+

This section describes potential usages of IPAs.

+ +

+ 2.1.1 Travel + Planning +

+

A user would like to plan a trip to an international + conference and she needs visa information and airline + reservations. She will give the intelligent personal assistant + (IPA) her visa information (her citizenship, where she is going, + purpose of travel, etc.) and it will respond by telling her the + documentation she needs, how long the process will take and what + the cost will be. This may require the personal assistant to + consult with an auxiliary web service or another personal + assistant that knows about visas.

+ +

Once the user has found out about the visa, she tells the IPA + that she wants to make airline reservations. She specifies her + dates of travel and airline preferences and the IPA then + interacts with her to find appropriate flights.

+ +

A similar process will be repeated if the user wants to book + a hotel, find a rental car, or find out about local attractions + in the destination city. Booking a hotel as part of attending a + conference could also involve finding out about a designated + conference hotel or special conference rates, which, again, + could require interaction with the hotel or the conference's + IPA's.

+ +

+ 2.1.2 Emergency + Events +

+

User encounters emergency situations that requires them to + use their hands while administering medical care, driving or + operating machinery. Manual interactions on control panels, + keyboards or touch pads can impede life saving activities and + diminish focus while operating sensitive vehicles, devices and + machinery. User would benefit from a secure, interoperable, + voice interactive system that can be used to access necessary + information, keeping hands free to perform these actions.

+ +

Examples of emergency applications include:

+ + + +

All of these use cases benefit from voice interaction systems + that have:

+ + + +

Interoperability:

+ + + +

+ 2.2 Roles and Responsibilities +

+ +

The following roles and responsibilities following the RACI + (responsible, accountable, consulted, informed) are identified

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
RoleRACI
Platform providerxx
Content Ownerxx
Developerxx
Designer and Application Developerx
System Integratorx
User
+ +
+
Platform provider
+
Accountable and responsible for the operative + performance of the infrastructure (uptime, security, + performance as measured against service-level agreements + (SLAs) with clients, customers, and partners, inclusive of + on-premises hardware and cloud services.
+
Content Owner
+
+ Accountable for the UX, content, and operational performance + of any and all assistants that represent the brand and its + services to brand constituents (including clients, + customers, and internal stakeholders).
Example: a + financial services enterprise, such as a bank +
+
Developer
+
+ Responsible to the content owner for the + + Example: Most often, an independent enterprise specializing + in conversational assistance. +
+
Designer and Application Developer
+
+ Responsible to the content owner for + +
+
System Integrator
+
+ Responsible to content owner for + +
+
User
+
Uses the IPA
+
+ +

+ 3. Architecture +

+ +

+ In order to cope with such use cases as + those described above an IPA follows the general design concepts + of a voice user interface, as can be seen in Figure 1. +

+ +

+ The architecture described in this document follows the SOLID + principle introduced by Robert C. Martin to arrive at a + scalable, understandable and reusable software solution. +

+
+
Single responsibility principle
+
The components should have only one clearly-defined + responsibility.
+
Open closed principle
+
Components should be open for extension, but closed for + modification.
+
Liskov substitution principle
+
Components may be replaced without impacts onto the + basic system behavior.
+
Interface segregation principle
+
Many specific interfaces are better than one + general-purpose interface.
+
Dependency inversion principle
+
High-level components should not depend on low-level + components. Both should depend on their interfaces.
+
+ +
Basic IPA Architecture +
Fig. 1 Basic architecture of an IPA
+

This architecture follows a traditional partitioning of + conversational systems, with separate components for speech + recognition, natural language understanding, dialog management, + natural language generation, and audio output, (audio files or + text to speech). This architecture does not rule out combining + some of these components in specific systems.

+ +

This architecture aims at serving, among others, the + following most popular high-level use cases for IPAs

+
    +
  1. Question Answering or Information Retrieval
  2. +
  3. Executing local and/or remote services to accomplish + tasks
  4. +
+

This is supported by a flexible architecture that supports + dynamically adding local and remote services or knowledge + sources such as data providers. Moreover, it is possible to + include other IPAs, with the same architecture, and forward + requests to them, similar to the principle of a russian doll + (omitting the Client Layer). All this describes the capabilities + of the IPA. These extensions may be selected from a standardized + marketplace. For the reminder of this document, we consider an + IPA that is extendable via such a marketplace.

+ +

Not all components may be needed for actual implementations, + some may be omitted completely. However, we note them here to + provide a more complete picture. This architecture comprises + three layers that are detailed in the following sections

+
    +
  1. Client Layer
  2. +
  3. Dialog Layer
  4. +
  5. External Data / Services / IPA + Providers
  6. +
+

Actual implementations may want to distinguish more than + these layers. The assignment to the layers is not considered to + be strict so that some of the components may be shifted to other + layers as needed. This view only reflects a view that the + Community Group regard as ideal and to show the intended + separation of concerns.

+ + +

+ 3.1 Client Layer +

+

The Client Layer contains the main components that interface + with the user. The following figure details the view onto the + Client Layer shown in Figure 1.

+ + +

+ 3.1.1 Capture +

+ +

+ Capture devices or modality recognizers are used to capture + mutlimodal user input, such as voice or text input. Additional + input modalities can be employed that capture input with a + specific modality recognizers. Additional input may be gathered + from Local Data Providers +

+ +
+ 3.1.1.1 Microphone +
+

The microphone is used to capture the voice input of a user + as a primary input modality.

+ +
+ 3.1.1.2 Keyboard +
+

The keyboard may be optionally used to capture the text input + if the IPA accepts this input modality.

+ +

+ 3.1.2 Presentation +

+

+ Presentation devices or modality synthesizers are used to + provide system output to the user. Additional output modalities + can be employed that render their output with a specific + modality synthesizer. It is not always required that a verbal + auditory output is made as a reply to a user. The user can also + become aware of the output as a consequence of an observable + action as a result of a Local + Service within the Client Layer + or an External Services call + from the External Data / Services / IPA + Providers Layer. In these cases an additional nonverbal + auditory output may be considered. +

+ +
+ 3.1.2.1 Speaker +
+

The loudspeaker is used to output replies as verbal auditory + output in the shape of spoken utterances as a primary output + modality. Utterances may be accompanied by nonverbal auditory + output such as

+ + +
+ 3.1.2.2 Display +
+

The display may be optionally used to present text output if + the IPA supports this output modality.

+ +

+ 3.1.3 IPA Client +

+

Clients enable the user to access the IPA via voice with the + following characteristics.

+ + +
+ 3.1.3.1 Client Activation Strategy +
+

+ The Client Activation Strategy defines how the client gets + activated to be ready to receive spoken commands as input. In + turn the Microphone is opened for + recording. Client Activation Strategies are not exclusive but + may be used concurrently. The most common activation strategies + are described in the table below +

+ + + + + + + + + + + + + + + + + + + + + + + + + +
Client Activation StrategyDescription
Push-to-talkThe user explicitly triggers the start of the + client by means of a physical or on-screen button or its + equivalent in a client application.
HotwordIn this case, the user utters a predefined word or + phrase to activate the client by voice. Hotwords may + also be used to preselect a known IPA + Provider. In this case the identifier of that IPA Provider is also used as + additional metadata augmenting the input This hotword is + usually not part of the spoken command that is passed + for further evaluation. +
Gesture-to-talkThe user triggers the start of the client by means + of a gesture, e.g. raising the hand to be detected by a + sensor.
Local Data + ProvidersIn this case, a change in the environment may + activate the client, for example if the user enters a + room.
......
+

The usage of hotwords includes privacy aspects as the + microphone needs to be always active. Streaming to the + components outside the user's control should be avoided, hence + detection of hotwords should ideally happen locally. With regard + to nested usage of IPAs that may feature their own hotwords, the + detection of hotwords might be required to be extensible.

+ +
+ 3.1.3.2 Local Service Registry +
+

+ A registry for all Local Services + and Local Data Providers that + can be accessed by the client +

+

+ +

+ 3.1.3 Local Services +

+

Local services can be used to execute local actions in the + user's local environment. Examples include turning on the light + or starting an application, for instance a navigation system in + a car.

+ +

+ 3.1.4 Local Data Providers +

+

+ Local Data Providers capture input that is accessible in the + user's local environment. They can be used to provide additional + input to the IPA Client or to provide + additional information that is needed to execute services. An + example for the latter is the state of the light, either turned + on or turned off. +

+ +

+ 3.2 Dialog Layer +

+

The Dialog Layer contains the main components to drive the + interaction with the user. The following figure details the + high-level view of the Dialog Layer shown in Figure 1.

+ + +

+ 3.2.1 IPA Service +

+

+ The general IPA Service API mediates between the user and the + overall IPA system. The service layer may be omitted in case the + IPA Client communicates directly with Dialog Manager. However, this is + not recommended as it may contradict the principle of + separation-of-concerns. It has the following characteristics +

+

+ +

+ 3.2.2 ASR +

+

The Automated Speech Recognizer (ASR) receives audio streams + of recorded utterances and generates a recognition hypothesis as + text strings for the local IPA. Conceptually, ASR is a modality + recognizer for speech. It has the following characteristics +

+

+ +

+ 3.2.3 NLU +

+

An Natural Language Understanding (NLU) component that able + to extract meaning as intents and associated entities from an + utterance as text strings. +

+
Intent
+
An intent is a group of utterances with similar + meaning.
+
Entity
+
An entity captures additional information to an intent.
+
+ + The NLU component has the following characteristics + +

+ +

+ 3.2.4 Dialog Manager +

+

+ The Dialog Manager is a component that receives semantic + information determined from user input, updates the dialog history, its internal state, + decides upon subsequent steps to continue a dialog and provides + output, mainly as synthesized or recorded utterances. + Conceptually the dialog manager defines the playground that is + used by the Dialogs and contributes + significantly to the user experience. The Dialog Manager has the + following characteristics +

+

+ +
+ 3.2.4.1 Dialog Strategy +
+

A Dialog Strategy is a conceptualization of a dialog for an + operationalization in a computer system. It defines the + representation of the dialog's state and respective operations + to process and generate events relevant to the interaction. This + specification is agnostic to the employed Dialog Strategy. + Examples of dialog strategy include

+ + + + + + + + + + + + + + + + + + + + + + + + + +
Dialog StrategyExample
State-basedState + Chart XML (SCXML): State Machine Notation for + Control Abstraction
Frame-basedVoice + Extensible Markup Language (VoiceXML) 2.1
Plan-basedInformation + State Update
Dialog State TrackingMachine + Learning for Dialog State Tracking: A Review
......
+ +
+ 3.2.4.2 Session +
+

+ Dialog execution can be governed by sessions, e.g. to free + resources of ASR and NLU engines when a session expires. + Linguistic phenomena, like anaphoric references and ellipsis, + are expected to work within a session. Conceptually, multiple + sessions can be active in parallel on a single IPA depending on + the capabilities of the IPA. The selected IPA + Providers or the Dialog Manager + may have leading roles for the task of session management. +

+

A session begins when

+ +

may continue over multiple interaction turns, i.e. an input + and output cycle, and ends

+ +

This includes the possibility that a session may persist over + multiple requests.

+ +

+ 3.2.5 Context +

+ +

+ During the interaction with a user all kinds of information are + collected and managed in the so-called conversation context or + dialog context. It contains all the short and long term + information needed to handle a conversation and thus may exceed + the concept of a session. It also serves + for context-based reasoning with the help of the Knowledge Graph and to generate + output for the output to the user NLG. It is + not possible to capture each and every aspect of what context + should comprise as discussions about context are likely to end + up in trying to explain the world. For the sake of this + specification it should be possible to deal with the following + characteristics +

+ + +
+ 3.2.5.1 History +
+

+ The Dialog History mainly stores the past dialog events per + user. Dialog events include users’ transcriptions, semantic + interpretations and resulting actions. Thus, it has information + on how the user reacted in the past and knows her preferences. + The history may also be used to resolve anaphoric references in + the NLU or can be used as temporary knowledge + in the Knowledge Graph. +

+ +
+ 3.2.5.2 Knowledge Graph +
+

+ The system uses a knowledge graph, e.g., to reason about + entities and intents. This may be received from the detected + input from the NLU or Data Providers to come up with + some more meaningful data matching the current task better. One + example is the use of the name of a person as a navigation + target as a person usually has an address that qualifies to be + used in navigation tasks. +

+ +

+ 3.2.6 NLG +

+

The natural language generation (NLG) component is + responsible for preparing the natural language text that + represents the system’s output. It has the following + characteristics +

+

+ +

+ 3.2.7 TTS +

+

The Text-to-Speech (TTS) component receives text strings, + which it converts into audio data. Conceptually, the TTS is a + modality specific renderer for speech. It has the following + characteristics +

+

+ +

+ 3.2.8 Dialogs +

+ +

Dialogs support interaction with the user. They include Core + Dialogs, which are built into the system, and provide basic + interactions, as well as more specialized dialogs which support + additional functionality.

3.2.8.1 Core Dialog @@ -699,219 +1224,529 @@
as described in the following section that is always available.

-
3.2.8.2 Dialog
-

A Dialog is able to handle functionality that can be added to the capabilities of the Dialog Manager through its associated Intent Sets. - Dialogs are logical entities within the overall description of the interaction with the user, executed by the Dialog Manager. - - Dialogs must serve different purposes in the sense that they are unique for a certain task. E.g., only a single flight reservation dialog may exist at a time. - Dialogs have the following characteristics -

-

- -
3.2.8.3 Core Intent Sets
-

A Core Intent Set usually identifies tasks to be executed and defines the capabilities of the Core Dialog. - Conceptually, the Core Intent Sets are Intent Sets that are always available.

- -
3.2.8.4 Intent Sets
-

Intent Sets define actions, identified by the name of the intent, along with their parameters as entities as it is produced by the NLU that can be consumed by a corresponding - Dialog and have the following characteristics -

-

- -
3.2.8.5 Dialog X
-

The Dialog X's are able to handle functionality that can be added to the capabilities of the Dialog Manager through their associated Intent Set X. A Dialog X extends the - Core Dialogs and add functionality by custom Dialogs. The Dialog X's must server different purposes - in a sense that they are unique for a certain task. E.g., only a single flight reservation dialog may exist at a time. They have the same characteristics as a Dialog.

- -
3.2.8.6 Intent Set X
-

An Intent Set X is a special Intent Set that identifies tasks that can be executed within the associated Dialog X.

- -
3.2.8.7 Dialog Registry
-

The Dialog Registry manages all available Dialogs with their associated Intent Sets with respect to the current Dialog Strategy. - This means, it is the Dialog Registry that would know which Dialog to use for a given intent. - For some Dialog Strategy this component may be omitted as it is taken over by the Dialog Manager. - One of these cases is when the Dialog Strategies does not allow for the dynamic handling of Dialogs as described below. -

-

- -

3.3 External Data / Services / IPA Providers Layer

- - -

3.3.1 Provider Selection Service

- -

A service that provides access to all known Data Providers, External Services and IPA Providers. This service also maps the IPA Intent Sets to the Intent Sets in the Dialog layer. - It has the following characteristics -

-

- -
3.3.1.1 Provider Selection Strategy
-

The Provider Selection Strategy aims at determining those IPA Providers that are most likely suited to handle the current input. - Generally,the system should not make any assumptions about the user's current input as she may switch goals with each input but there may be some deviating use cases. - The provider selection strategy may be implemented for example as one of the following options or a combination thereof to determine a list of IPA Providers candidates. -

- In case the IPA Provider does not abstract from determining a relevant list of intents, the same strategy may be applied to determine the n-best intents. -

- -
3.3.1.2 Provider Registry
-

A registry for all IPA Providers that can be accessed. It has the following characteristics -

-

- -
3.3.1.3 Accounts/Authentication
-

A registry that knows how to access the known IPA Providers, i.e., which are available and credentials to access them. Storing of credentials must meet security and trust considerations that are - expected from such a personalized service. It has the following characteristics -

-

- -
3.3.2 External Service Registry
-

A registry for all External Services and Data Providers that can be accessed by the client -

-

- -

3.3.3 Data Providers

- -

Data Providers obtain data from various external sources for use in the interaction, for example, data obtained from a third-party web service.

- -
3.3.3.1 Data Provider X
-

A data provider to get data to be used in the Dialog, e.g. as a result of a query.

- -

3.3.4 External Services

- -

External Services provide access to trigger actions outside of the system; for example, triggered from a third-party web service.

- -
3.3.4.1 External Service X
-

A specific External Service, which provides output of the system, e.g. through an application can use multiple External Services.

- -

3.3.5 IPA Providers

- -

IPA providers provide IPA's that can interact with users in an application.

+
+ 3.2.8.2 Dialog +
+

+ A Dialog is able to handle functionality that can be added to + the capabilities of the Dialog + Manager through its associated Intent Sets. Dialogs are + logical entities within the overall description of the + interaction with the user, executed by the Dialog Manager. Dialogs must serve + different purposes in the sense that they are unique for a + certain task. E.g., only a single flight reservation dialog may + exist at a time. Dialogs have the following characteristics +

+

+ +
+ 3.2.8.3 Core Intent Sets +
+

+ A Core Intent Set usually identifies tasks to be executed and + defines the capabilities of the Core + Dialog. Conceptually, the Core Intent Sets are Intent Sets + that are always available. +

+ +
+ 3.2.8.4 Intent Sets +
+

+ Intent Sets define actions, identified by the name of the + intent, along with their parameters as entities as it is + produced by the NLU that can be consumed by a + corresponding Dialog and have the + following characteristics +

+

+ +
+ 3.2.8.5 Dialog X +
+

+ The Dialog X's are able to handle functionality that can be + added to the capabilities of the Dialog Manager through their + associated Intent Set X. A Dialog X + extends the Core Dialogs and add + functionality by custom Dialogs. The + Dialog X's must server different purposes in a sense that they + are unique for a certain task. E.g., only a single flight + reservation dialog may exist at a time. They have the same + characteristics as a Dialog. +

+ +
+ 3.2.8.6 Intent Set X +
+

+ An Intent Set X is a special Intent + Set that identifies tasks that can be executed within the + associated Dialog X. +

+ +
+ 3.2.8.7 Dialog Registry +
+

+ The Dialog Registry manages all available Dialogs with their + associated Intent Sets with respect to the current Dialog Strategy. This means, it + is the Dialog Registry that would know which Dialog + to use for a given intent. For some Dialog + Strategy this component may be omitted as it is taken over + by the Dialog Manager. One of these + cases is when the Dialog + Strategies does not allow for the dynamic handling of Dialogs as described below. +

+

+ +

+ 3.3 External Data / Services / IPA + Providers Layer +

+ + +

+ 3.3.1 Provider Selection Service +

+ +

A service that provides access to all known Data Providers, + External Services and IPA Providers. This service also maps the + IPA Intent Sets to the Intent Sets in the Dialog layer. It has + the following characteristics +

+

+ +
+ 3.3.1.1 Provider Selection Strategy +
+

+ The Provider Selection Strategy aims at determining those IPA Providers that are most likely + suited to handle the current input. Generally,the system should + not make any assumptions about the user's current input as she + may switch goals with each input but there may be some deviating + use cases. The provider selection strategy may be implemented + for example as one of the following options or a combination + thereof to determine a list of IPA + Providers candidates. +

+ In case the + IPA Provider does not abstract from + determining a relevant list of intents, the same strategy may be + applied to determine the n-best intents. +

+ +
+ 3.3.1.2 Provider Registry +
+

A registry for all IPA Providers that can be accessed. It has + the following characteristics +

+

+ +
+ 3.3.1.3 Accounts/Authentication +
+

A registry that knows how to access the known IPA Providers, + i.e., which are available and credentials to access them. + Storing of credentials must meet security and trust + considerations that are expected from such a personalized + service. It has the following characteristics +

+

+ +
+ 3.3.2 External Service Registry +
+

+ A registry for all External + Services and Data Providers + that can be accessed by the client +

+

+ +

+ 3.3.3 Data Providers +

+ +

Data Providers obtain data from various external sources for + use in the interaction, for example, data obtained from a + third-party web service.

+ +
+ 3.3.3.1 Data Provider X +
+

+ A data provider to get data to be used in the Dialog, + e.g. as a result of a query. +

+ +

+ 3.3.4 External Services +

+ +

External Services provide access to trigger actions outside + of the system; for example, triggered from a third-party web + service.

+ +
+ 3.3.4.1 External Service X +
+

A specific External Service, which provides output of the + system, e.g. through an application can use multiple External + Services.

+ +

+ 3.3.5 IPA Providers +

+ +

IPA providers provide IPA's that can interact with users in + an application.

+ +

+ In this sense an IPA might be again a fully fledged IPA, with + the exception of the Client Layer as + this IPA will take over the role of a client to the nested IPA. + Actually, this can be perceived as the Matryoshka (or Russian + Doll) principle1. Each + IPA may be perfectly used as is but can also be approached by + other IPAs. +

+ + +
+ 3.3.5.1 IPA Provider X +
+

A provider of an IPA service, like +

+

+

The IPA provider may be part of the IPA implementation as an + IPA Provider or alternatively a subset of the original + functionality as described below as part of another IPA + implementation.

+ +
+ 3.3.5.2 Provider ASR +
+

+ An ASR component receives audio streams of recorded utterances + and generates a recognition hypothesis as text strings as an + input for the Provider NLU. +

+ +
+ 3.3.5.2 Provider NLU +
+

+ An NLU component that is able to extract meaning as intents and + associated entities from an utterance as text strings for IPA Provider X. It has the following + characteristics +

+

+ +
+ 3.3.5.3 Provider Intent Set +
+

+ An Intent Set that might be returned + by the Provider NLU to handle the + capabilities of IPA Provider X. +

+ +

+ 3.6 Resulting Architecture +

+

The previous sections showed a more detailed view onto the + architectural buildings blocks. A general overview comprising + these detailing is shown in the following figure.

+ +
IPA Architecture
Fig. + 2 Complete architecture of an IPA

4. Error Handling @@ -944,313 +1779,474 @@

  • derive a new higher-level error from the received errors and forward this higher-level error
  • -

    In case errors could be handled it is recommended to log the errors for - debugging.

    +

    In case errors could be handled it is recommended to log the + errors for debugging.

    An error message should contain at least

    -

    5. Use Case Walk Through

    -

    This section needs to be updated to match the changes as introduced above.

    - -

    This section expands on the use case above, filling in details according to the sample architecture.

    -

    A user would like to plan a trip to an international conference and she -needs visa information and airline reservations.

    - -

    The user starts by asking a general purpose assistant (IPA Client, on the left -of the diagram) about what the visa requirements are for her situation. For -a common situation, such as citizens of the EU traveling to the United -States, the IPA is able to answer the question directly from one of its -dialogs 1-n getting the -information from a web service that it knows about via the corresponding Data Provider. -However, for less common situations (for example, a citizen of -South Africa traveling to Japan), the generic IPA will try to identify a -visa expert assistant application from the dialog registry. If it finds one, -it will connect the user with the visa expert, one of the IPA providers on -the right side. The visa expert will then engage in a dialog with the user -to find out the dates and purposes of travel and will inform the user of the -visa process.

    - -

    Once the user has found out about the visa, she tells the IPA that she wants -to make airline reservations. If she wants to use a particular service, or -use a particular airline, she would say something like "I want to book a -flight on American". The IPA will then either connect the user with -American's IPA or, if American doesn't have an IPA, will inform the user of -that fact. On the other hand, if the user doesn't specify an airline, the -IPA will find a general flight search IPA from its registry and connect the -user with the IPA for that flight search service. The flight search IPA -will then interact with the user to find appropriate flights.

    - -

    A similar process would be repeated if the user wants to book a hotel, find -a rental car, find out about local attractions in the destination city, etc. -Booking a hotel could also involve interacting with the conference's IPA to -find out about a designated conference hotel or special rates. -

    - -

    5.1 Detailed Walkthrough

    -

    - This section provides a detailed walkthrough that aligns the steps in the use case interaction with the architecture. It covers only the part from the example above that the user - asks for a flight travel with a dedicated airline. This very basic example assumes that - this is the first request to IPA and that there is a suitable dialog ready that matches the user's request. It may also vary, e.g., depending on the used Dialog Strategy and other optional items that may actually result in different flows. - The walkthrough is split into two parts for the input path and for the output path. - -

    5.1.1 Walkthrough for the Input Path

    - - We begin with the case where the user's request can be handled by one of the internal Dialogs in the Dialog box. - The input side is illustrated in the following figure -
    - IPA Architecture Walkthrough for the input -
    Fig. 3 Walkthrough for the output path of an IPA
    -
    +

    + 5. Use Case Walk Through +

    +

    This section needs to be updated to match the changes as + introduced above.

    + +

    This section expands on the use case above, filling in + details according to the sample architecture.

    +

    A user would like to plan a trip to an international + conference and she needs visa information and airline + reservations.

    + +

    + The user starts by asking a general purpose assistant (IPA Client, on the left of the diagram) + about what the visa requirements are for her situation. For a + common situation, such as citizens of the EU traveling to the + United States, the IPA is able to answer the question directly + from one of its dialogs 1-n getting the + information from a web service that it knows about via the + corresponding Data Provider. + However, for less common situations (for example, a citizen of + South Africa traveling to Japan), the generic IPA will try to + identify a visa expert assistant application from the dialog registry. If it finds one, + it will connect the user with the visa expert, one of the IPA providers on the right side. The + visa expert will then engage in a dialog with the user to find + out the dates and purposes of travel and will inform the user of + the visa process. +

    + +

    Once the user has found out about the visa, she tells the IPA + that she wants to make airline reservations. If she wants to use + a particular service, or use a particular airline, she would say + something like "I want to book a flight on American". The IPA + will then either connect the user with American's IPA or, if + American doesn't have an IPA, will inform the user of that fact. + On the other hand, if the user doesn't specify an airline, the + IPA will find a general flight search IPA from its registry and + connect the user with the IPA for that flight search service. + The flight search IPA will then interact with the user to find + appropriate flights.

    + +

    A similar process would be repeated if the user wants to book + a hotel, find a rental car, find out about local attractions in + the destination city, etc. Booking a hotel could also involve + interacting with the conference's IPA to find out about a + designated conference hotel or special rates.

    + +

    + 5.1 Detailed Walkthrough +

    +

    This section provides a detailed walkthrough that aligns the + steps in the use case interaction with the architecture. It + covers only the part from the example above that the user asks + for a flight travel with a dedicated airline. This very basic + example assumes that this is the first request to IPA and that + there is a suitable dialog ready that matches the user's + request. It may also vary, e.g., depending on the used Dialog + Strategy and other optional items that may actually result in + different flows. The walkthrough is split into two parts for the + input path and for the output path. +

    + 5.1.1 Walkthrough for the Input Path +

    + + We begin with the case where the user's request can be handled by + one of the internal Dialogs in the Dialog box. The input side is + illustrated in the following figure +
    IPA Architecture Walkthrough for the input
    Fig. + 3 Walkthrough for the output path of an IPA
    +
      +
    1. The user asks the IPA client about a travel between the + EU and the United States. The IPA Cient captures the audio + with the help of the microphone.
    2. +
    3. Requests are usually augmented by other data. The GPS + location is one example that could be useful. Therefore the + IPA Client asks the Local Data Provider for GPS for the + current location...
    4. +
    5. ...and gets it back. In this case the GPS coordinates + from Mountain View, California.
    6. +
    7. The audio is sent along with all augmenting data to the + IPA Service.
    8. +
    9. The IA Service forwards the received data + simultaneously to the ASR in the local path and to the + Provider Selection Service in the remote path.
    10. +
    11. The decoded text of the user's request, in this example + "I want to book a flight on American" with all augmented + data in parallel to the NLU component for the local path and + to the Provider Selection Service for the remote path.
    12. +
    13. In the local path the NLU tries to determine intents + and entities from the decoded text. For our example this may + be intent: plan-flight-travel with entity destination: + American. The NLU components makes use of the context to + check if there are complementary information that might have + been established throughout the interaction with the user, + such as preferred times for departure or arrival.
    14. +
    15. There was no info to add from the history but the GPS + information could be mapped with the help of the Knowledge + Graph to origin: SFO so the local input path is completed + with this step with the result: plan-flight-travel with + entities airline: American, origin: SFO.
    16. +
    17. The remote path starts with the Provider Selection + Service asking the Provider Registry for suitable IPA + Providers for the incoming request.
    18. +
    19. The Provider Registry filters the suitable IPA + Providers and asks for credentials at the + Accounts/Authentication component. For the example, these + may be those supporting English. At this level, only the + pure text is known and the used language. Further knowledge + about the user may be helpful to reduce these candidates.
    20. +
    21. The Provider Registry receives the credentials for the + IPA Provider candidates.
    22. +
    23. The Provider Selection Service receives the list of IPA + Providers along with their credentials, if any, back.
    24. +
    25. The Provider Registry forwards the text "I want to book + a flight on American" from the utterance and the GPS + coordinates for Mountain View to the received list of IPA + Providers in parallel to determine meaning which completes + the remote input path.
    26. +
    +

    + 5.1.2 Walkthrough for the Output Path +

    + The output path begins where the local NLU and IPA Providers are + able to deliver their results. In both paths the best match for the + intents and entities based on the received data have been + identified. This path is illustrated in the following figure +
    IPA Architecture Walkthrough for the output
    Fig. + 4 Walkthrough for the output path of an IPA
    + +
      +
    1. The IPA Providers send their determined intents along + with recognized entities to the Provider Selection Service. + For our example this may be + Note, that the reply also contains an identification of the + provider for their result. This allows pre-selection of a + provider in possible follow-up dialog turns. +
    2. +
    3. The Provider Selection Service maps the custom intents + and entities to the core intents and entities that can be + understood in the dialogs. For our example this could be + It then sends this mapped result to the Dialog Manager as + an n-best list. +
    4. +
    5. On the local path the NLU sends it result to the Dialog + Manager. For our example this could be + +
    6. The Dialog Manager determines an n-best list of + meanings from the local and remote path as + It selects the best suited reply. For our example, it may + remove the results from IPA Provider 2 and IPA Provider 3 as + the confidence for the entity is very low and updates the + History with the determined dialog move from the user. + Results from IPA Provider 1 and Local NLU have the same + result, however due to the employed rules, IPA Provider 1 is + selected as cloud based providers are expected to have + better accuracy than local engines because of constraints + with the embedded environment. +
    7. +
    8. The Dialog Manger then sends the intent, + plan-flight-travel to the Dialog Registry to determine the + corresponding dialog...
    9. +
    10. ...and receives the dialog to use back. For the example + this may be the plan-flight-travel-dialog.
    11. +
    12. The Dialog Manager calls the plan-flight-travel dialog + and fills all known entities. In our example, the slots for + airline and origin would be filled.
    13. +
    14. The Dialog determines the next dialog step and + indicates the request for a system move to query the user + for the missing data.
    15. +
    16. The History is updated with this dialog move ...
    17. +
    18. ...and forwarded to the NLG to create a response.
    19. +
    20. The NLG makes use of the Context to check output + preferences and already established knowledge between the + user and the system that might be used in the reply...
    21. +
    22. ...and receives the info back to come up with the + question "Do you want to fly from San Francisco with + American?",
    23. +
    24. The NLU forwards the text string "Do you want to fly + from San Francisco with American?" to the TTS to be + converted into audio.
    25. +
    26. The TTS engine sends the audio file from the response + to the IPA Client to be made audible...
    27. +
    28. ...in the Speaker.
    29. +
    +

    + +

    + 7. Potential for Standardization +

    + +

    The general architecture of IPAs described in this document + should be detailed in subsequent documents. Further work must be + done to

      -
    1. The user asks the IPA client about a travel between the EU and the United States. The IPA Cient captures the audio with the help of the microphone.
    2. -
    3. Requests are usually augmented by other data. The GPS location is one example that could be useful. Therefore the IPA Client asks the Local Data Provider for GPS for the current location...
    4. -
    5. ...and gets it back. In this case the GPS coordinates from Mountain View, California.
    6. -
    7. The audio is sent along with all augmenting data to the IPA Service.
    8. -
    9. The IA Service forwards the received data simultaneously to the ASR in the local path and to the Provider Selection Service in the remote path.
    10. -
    11. The decoded text of the user's request, in this example "I want to book a flight on American" with all augmented data in parallel to the NLU component for the local path - and to the Provider Selection Service for the remote path.
    12. -
    13. In the local path the NLU tries to determine intents and entities from the decoded text. For our example this may be intent: - plan-flight-travel with entity destination: American. The NLU components makes use of the context to check if there are complementary information that might have been established - throughout the interaction with the user, such as preferred times for departure or arrival.
    14. -
    15. There was no info to add from the history but the GPS information could be mapped with the help of the Knowledge Graph to origin: SFO so the local input path is completed with this step - with the result: plan-flight-travel with entities airline: American, origin: SFO.
    16. -
    17. The remote path starts with the Provider Selection Service asking the Provider Registry for suitable IPA Providers for the incoming request.
    18. -
    19. The Provider Registry filters the suitable IPA Providers and asks for credentials at the Accounts/Authentication component. For the example, - these may be those supporting English. At this level, only the pure text is known and the used language. Further knowledge about the user may be helpful to - reduce these candidates.
    20. -
    21. The Provider Registry receives the credentials for the IPA Provider candidates.
    22. -
    23. The Provider Selection Service receives the list of IPA Providers along with their credentials, if any, back.
    24. -
    25. The Provider Registry forwards the text "I want to book a flight on American" from the utterance and the GPS coordinates for Mountain View to the received list of - IPA Providers in parallel to determine meaning which completes the remote input path.
    26. +
    27. specify the interfaces among the components
    28. +
    29. suggest new standards where they are missing
    30. +
    31. refer to existing standards where applicable
    32. +
    33. refer to existing standards as a starting point to be + refined for the IPA case
    -

    5.1.2 Walkthrough for the Output Path

    - The output path begins where the local NLU and IPA Providers are able to deliver their results. In both paths the best match for the intents and entities based on the received - data have been identified. - This path is illustrated in the following figure -
    - IPA Architecture Walkthrough for the output -
    Fig. 4 Walkthrough for the output path of an IPA
    -
    - -
      -
    1. The IPA Providers send their determined intents along with recognized entities to the Provider Selection Service. For our example this may be - - Note, that the reply also contains an identification of the provider for their result. This allows pre-selection of a provider in possible follow-up dialog turns. -
    2. -
    3. The Provider Selection Service maps the custom intents and entities to the core intents and entities that can be understood in the dialogs. For our example this could be - - It then sends this mapped result to the Dialog Manager as an n-best list.
    4. -
    5. On the local path the NLU sends it result to the Dialog Manager. For our example this could be - - -
    6. The Dialog Manager determines an n-best list of meanings from the local and remote path as - - It selects the best suited reply. For our example, it may remove the results from IPA Provider 2 and IPA Provider 3 as the confidence for the entity is very low and updates - the History with the determined dialog move from the user. Results from IPA Provider 1 and Local NLU have the same result, however due to the employed - rules, IPA Provider 1 is selected as cloud based providers are expected to have better accuracy than local engines because of constraints with the embedded environment.
    7. -
    8. The Dialog Manger then sends the intent, plan-flight-travel to the Dialog Registry to determine the corresponding dialog...
    9. -
    10. ...and receives the dialog to use back. For the example this may be the plan-flight-travel-dialog.
    11. -
    12. The Dialog Manager calls the plan-flight-travel dialog and fills all known entities. In our example, the slots for airline and origin would be filled.
    13. -
    14. The Dialog determines the next dialog step and indicates the request for a system move to query the user for the missing data.
    15. -
    16. The History is updated with this dialog move ...
    17. -
    18. ...and forwarded to the NLG to create a response.
    19. -
    20. The NLG makes use of the Context to check output preferences and already established knowledge between the user and the system that might be used in the reply...
    21. -
    22. ...and receives the info back to come up with the question "Do you want to fly from San Francisco with American?",
    23. -
    24. The NLU forwards the text string "Do you want to fly from San Francisco with American?" to the TTS to be converted into audio.
    25. -
    26. The TTS engine sends the audio file from the response to the IPA Client to be made audible...
    27. -
    28. ...in the Speaker.
    29. -
    -

    - -

    7. Potential for Standardization

    - -

    The general architecture of IPAs described in this document should be detailed in subsequent documents. Further work must be done to -

      -
    1. specify the interfaces among the components
    2. -
    3. suggest new standards where they are missing
    4. -
    5. refer to existing standards where applicable
    6. -
    7. refer to existing standards as a starting point to be refined for the IPA case
    8. -
    - Currently, the authors see the following situation at the time of writing - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    ComponentPotentially related standards
    IPA Client -
    IPA Servicenone
    Dialog Manager -
    TTS -
    ASR -
    Core Dialog
    Core Intent Setnone
    Dialog Registry -
    Provider Selection Servicenone
    Accounts/Authentication -
    NLU -
    Knowledge Graph
    Data Providernone
    -

    -

    The table above is not meant to be exhaustive nor does it claim that the identified standards are suited for IPA implementations. They must be analyzed in more detail in subsequent work. The majority - are starting points for further refinement. For instance, the authors consider it unlikely that VoiceXML will actually be used in IPA implementations.

    -

    Out of scope of a possible standardization is the implementation inside the IPA Providers and potential interoperability among them. - However, it eases the the integration of their exposed services or even allow to use services across different providers. Actual IPA providers may make use of any - upcoming standard to enhance their deployments as a marketplace of intelligent services.

    - -

    7. Footnotes

    - - 1. The Russian Doll principle is a recursion technique - that is used in computer science, mathematics, logic, grammar, and - art. It is a problem-solving strategy for dealing with complexity, - where the same control structure always occurs on multiple, - infinitely nested levels. The principle is illustrated in the form - of Russian dolls (matryoshkas) that are nested such that the same - homomorphic structure appears on each level. Summarized from - Pfiffner, M. (2022). Russian Dolls. In: The Neurology of Business. - Management for Professionals. Springer, Cham. - https://doi.org/10.1007/978-3-031-14260-4_5. - - -

    8. Appendix

    - -

    8.1 Acknowledgements

    - -

    - This version of the document was written with the participation of members of the W3C Voice Interaction Community Group. - The work of the following members has significantly facilitated the development of this document: -

    -

    - -

    7.2 Abbreviations

    - - - - - - - - - - - - - - - - - - - - - - -
    AbbreviationDescription
    ASRAutomated Speech Recognition
    NLGNatural Language Generation
    NLUNatural Language Understanding
    TTSText to Speech
    - + Currently, the authors see the following situation at the time of + writing + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    ComponentPotentially related standards
    IPA Client + +
    IPA Servicenone
    Dialog Manager + +
    TTS + +
    ASR + +
    Core Dialog
    Core Intent Setnone
    Dialog Registry + +
    Provider Selection Servicenone
    Accounts/Authentication + +
    NLU + +
    Knowledge Graph
    Data Providernone
    +

    +

    + The table above is not meant to be exhaustive nor does it claim + that the identified standards are suited for IPA + implementations. They must be analyzed in more detail in + subsequent work. The majority are starting points for further + refinement. For instance, the authors consider it unlikely that + VoiceXML will + actually be used in IPA implementations. +

    +

    Out of scope of a possible standardization is the + implementation inside the IPA Providers and potential + interoperability among them. However, it eases the the + integration of their exposed services or even allow to use + services across different providers. Actual IPA providers may + make use of any upcoming standard to enhance their deployments + as a marketplace of intelligent services.

    + +

    + 7. Footnotes +

    + + 1. The Russian Doll principle is a recursion + technique that is used in computer science, mathematics, logic, + grammar, and art. It is a problem-solving strategy for dealing + with complexity, where the same control structure always occurs + on multiple, infinitely nested levels. The principle is + illustrated in the form of Russian dolls (matryoshkas) that are + nested such that the same homomorphic structure appears on each + level. Summarized from Pfiffner, M. (2022). Russian Dolls. In: + The Neurology of Business. Management for Professionals. + Springer, Cham. https://doi.org/10.1007/978-3-031-14260-4_5. + + +

    + 8. Appendix +

    + +

    + 8.1 Acknowledgements +

    + +

    + This version of the document was written with the participation + of members of the W3C + Voice Interaction Community Group. The work of the following + members has significantly facilitated the development of this + document: +

    +

    + +

    + 7.2 Abbreviations +

    + + + + + + + + + + + + + + + + + + + + + + +
    AbbreviationDescription
    ASRAutomated Speech Recognition
    NLGNatural Language Generation
    NLUNatural Language Understanding
    TTSText to Speech
    + diff --git a/voice interaction drafts/paInterfaces/Major-Components-Interaction.svg b/voice interaction drafts/paInterfaces/Major-Components-Interaction.svg index 255ded5..c5f82af 100644 --- a/voice interaction drafts/paInterfaces/Major-Components-Interaction.svg +++ b/voice interaction drafts/paInterfaces/Major-Components-Interaction.svg @@ -1 +1 @@ -UserClientDialogExternal Data / Services / IPAProvidersopt service callopt external IPAspar [remote][local]generateClientResponse(): ClientResponsespeak(Audio): AudioprocessDialogInput(SemanticInterpretation,ExternalClientResponse)deriveSemanticInterpretation()processInput(ClientRequest): ExternalClientResponseprocessInput(ClientRequest): ClientResponsecallService(ServieParameters): ServiceResult \ No newline at end of file +UserClientDialogExternal Data / Services / IPAProvidersopt service callopt external IPAspar [remote][local]query(Audio): AudioprocessInput(ClientRequest): ExternalClientResponseprocessInput(ClientRequest): ClientResponseprocessDialogInput(LocalResponse,ExternalClientResponse)generateClientResponse(): ClientResponseprocessInput(ClientRequest): LocalResponsecallService(ServiceParameters): ServiceResult \ No newline at end of file diff --git a/voice interaction drafts/paInterfaces/paInterfaces.htm b/voice interaction drafts/paInterfaces/paInterfaces.htm index bb124aa..2506375 100644 --- a/voice interaction drafts/paInterfaces/paInterfaces.htm +++ b/voice interaction drafts/paInterfaces/paInterfaces.htm @@ -1,55 +1,81 @@ - + - Intelligent Personal Assistant Interfaces - - +Intelligent Personal Assistant Interfaces + +
    -

    - W3C

    +

    + W3C +

    -

    Intelligent Personal Assistant Architecture

    -

    Intelligent Personal Assistant Interfaces

    +

    Intelligent + Personal Assistant Architecture

    +

    Intelligent + Personal Assistant Interfaces

    Latest version
    -
    Last modified: March 27, 2024 https://github.com/w3c/voiceinteraction/blob/master/voice%20interaction%20drafts/paInterfaces/paInterfaces.htm (GitHub repository)
    - HTML rendered version
    +
    + Last modified: April 03, 2024 https://github.com/w3c/voiceinteraction/blob/master/voice%20interaction%20drafts/paInterfaces/paInterfaces.htm + (GitHub repository)
    HTML + rendered version +
    Editor
    -
    Dirk Schnelle-Walka
    - Deborah Dahl, Conversational Technologies
    +
    + Dirk Schnelle-Walka
    Deborah Dahl, Conversational + Technologies +
    - -
    + +

    Abstract

    -

    This document details the general architecture of Intelligent Personal - Assistants as described in - Architecture and Potential for Standardization Version 1.3 - with regard to interface definitions. The architectural descriptions - focus on intent-based voice-based personal assistants and chatbots. - Current LLM intent-less chatbots may have other interface needs.

    +

    + This document details the general architecture of Intelligent + Personal Assistants as described in Architecture + and Potential for Standardization Version 1.3 with regard to + interface definitions. The architectural descriptions focus on + intent-based voice-based personal assistants and chatbots. + Current LLM intent-less chatbots may have other interface needs. +

    Status of This Document

    -

    This specification was published by the - Voice Interaction Community Group. - It is not a W3C Standard nor is it on the W3C Standards Track. - Please note that under the - W3C Community Contributor License Agreement (CLA) - there is a limited opt-out and other conditions apply. Learn more about - W3C Community and Business Groups.

    +

    + This specification was published by the Voice + Interaction Community Group. It is not a W3C Standard + nor is it on the W3C Standards Track. Please note that under + the W3C + Community Contributor License Agreement (CLA) there is a + limited opt-out and other conditions apply. Learn more about + W3C Community and + Business Groups. + +

    Table of Contents

    @@ -57,66 +83,73 @@

    Table of Contents

  • Introduction
  • Problem Statement
  • Architecture
  • -
  • High Level Interfaces
  • +
  • High Level + Interfaces
  • Low Level Interfaces
  • - +

    1. Introduction

    -

    Intelligent Personal Assistants (IPAs) are now available in our - daily lives through our smart phones. Apple’s Siri, Google Assistant, - Microsoft’s Cortana, Samsung’s Bixby and many more are helping us with - various tasks, like shopping, playing music, setting a schedule, - sending messages, and offering answers to simple questions. - Additionally, we equip our households with smart speakers like - Amazon’s Alexa or Google Home which are available without the need to - pick up explicit devices for these sorts of tasks or even control - household appliances in our homes. As of today, there is no - interoperability among the available IPA providers. Especially for - exchanging learned user behaviors this is unlikely to happen at all.

    +

    Intelligent Personal Assistants (IPAs) are now available in + our daily lives through our smart phones. Apple’s Siri, Google + Assistant, Microsoft’s Cortana, Samsung’s Bixby and many more + are helping us with various tasks, like shopping, playing music, + setting a schedule, sending messages, and offering answers to + simple questions. Additionally, we equip our households with + smart speakers like Amazon’s Alexa or Google Home which are + available without the need to pick up explicit devices for these + sorts of tasks or even control household appliances in our + homes. As of today, there is no interoperability among the + available IPA providers. Especially for exchanging learned user + behaviors this is unlikely to happen at all.

    Furthermore, in addition to these general-purpose assistants, there are also specialized virtual assistants which are able to - provide their users with in-depth information which is specific to an - enterprise, government agency, school, or other organization. They may - also have the ability to perform transactions on behalf of their - users, such as purchasing items, paying bills, or making reservations. - Because of the breadth of possibilities for these specialized - assistants, it is imperative that they be able to interoperate with - the general-purpose assistants. Without this kind of interoperability, - enterprise developers will need to re-implement their intelligent + provide their users with in-depth information which is specific + to an enterprise, government agency, school, or other + organization. They may also have the ability to perform + transactions on behalf of their users, such as purchasing items, + paying bills, or making reservations. Because of the breadth of + possibilities for these specialized assistants, it is imperative + that they be able to interoperate with the general-purpose + assistants. Without this kind of interoperability, enterprise + developers will need to re-implement their intelligent assistants for each major generic platform.

    -

    This document is the second step in our strategy for IPA +

    + This document is the second step in our strategy for IPA standardization. It is based on a general architecture of IPAs described in Architecture and Potential for Standardization Version 1.3 - which aims at exploring - the potential areas for standardization. It focuses on voice as the - major input modality. We believe it will be of value not only to - developers, but to many of the constituencies within the intelligent - personal assistant ecosystem. Enterprise decision-makers, strategists - and consultants, and entrepreneurs may study this work to learn of - best practices and seek adjacencies for creation or investment. The - overall concept is not restricted to voice but also covers purely text - based interactions with so-called chatbots as well as interaction + href="https://w3c.github.io/voiceinteraction/voice%20interaction%20drafts/paArchitecture/paArchitecture-1-3.htm">Architecture + and Potential for Standardization Version 1.3 which aims at + exploring the potential areas for standardization. It focuses on + voice as the major input modality. We believe it will be of + value not only to developers, but to many of the constituencies + within the intelligent personal assistant ecosystem. Enterprise + decision-makers, strategists and consultants, and entrepreneurs + may study this work to learn of best practices and seek + adjacencies for creation or investment. The overall concept is + not restricted to voice but also covers purely text based + interactions with so-called chatbots as well as interaction using multiple modalities. Conceptually, the authors also define executing actions in the user's environment, like turning on the - light, as a modality. This means that components that deal with speech - recognition, natural language understanding or speech synthesis will - not necessarily be available in these deployments. In case of - chatbots, speech components will be omitted. In case of multimodal - interaction, interaction modalities may be extended by components to - recognize input from the respective modality, transform it into - something meaningful and vice-versa to generate output in one or more - modalities. Some modalities may be used as output-only, like turning - on the light, while other modalities may be used as input-only, like + light, as a modality. This means that components that deal with + speech recognition, natural language understanding or speech + synthesis will not necessarily be available in these + deployments. In case of chatbots, speech components will be + omitted. In case of multimodal interaction, interaction + modalities may be extended by components to recognize input from + the respective modality, transform it into something meaningful + and vice-versa to generate output in one or more modalities. + Some modalities may be used as output-only, like turning on the + light, while other modalities may be used as input-only, like touch.

    -

    In this second step we describe the interfaces of the general +

    + In this second step we describe the interfaces of the general architecture of IPAs in Architecture and Potential for Standardization Version 1.3. We believe it @@ -127,247 +160,386 @@

    best practices and seek adjacencies for creation or investment.

    -

    In order to cope with such use cases as those described above an IPA follows the general design concepts of a voice user interface, as can be seen in Figure 1.

    - -

    Interfaces are described with the help of UML diagrams. We expect the reader to be familiar with that notation, - although most concepts are easy to understand and do not require in-depth knowledge. The main diagram types used in this document are - component diagrams and - sequence diagrams. - The UML diagrams are provided as Enterprise Architect Model pa-architecture.EAP. They can be viewed with the free of charge tool - EA Lite

    - -

    2. Problem Statement

    - -

    2.1 Use Cases

    -

    This section describes potential usages of IPAs that will be used later in the document to - illustrate the usage of the specified interfaces.

    - -

    2.1.1 Weather Information

    - -

    A user located in Berlin, Germany, is planning to visit her friend a few kilometers away, the next day. - As she considers taking the bike, she asks the IPA for weather conditions.

    - -

    2.1.2 Flight Reservation

    - -

    A user located in Berlin, Germany, would like to plan a trio to an international conference - and she wants to book a flight to the conference in San Francisco. Therefore, she - approaches the IPA to help her with booking the flight,

    - - -

    3. Architecture

    - -

    The architecture described in this document follows the SOLID principle - introduced by Robert C. Martin to arrive at a scalable, understandable and reusable software solution. -

    -
    Single responsibility principle
    -
    The components should have only one clearly-defined responsibility.
    -
    Open closed principle
    -
    Components should be open for extension, but closed for modification.
    -
    Liskov substitution principle
    -
    Components may be replaced without impacts onto the basic system behavior.
    -
    Interface segregation principle
    -
    Many specific interfaces are better than one general-purpose interface.
    -
    Dependency inversion principle
    -
    High-level components should not depend on low-level components. Both should depend on their interfaces.
    -
    -

    +

    + In order to cope with such use cases as + those described above an IPA follows the general design concepts + of a voice user interface, as can be seen in Figure 1. +

    -

    - This architecture follows a traditional partitioning of conversational systems, with separate components for speech recognition, natural language understanding, dialog management, natural language generation, and audio output, (audio files or text to speech). This architecture does not rule out combining some of these components in specific systems. -

    - -

    This architecture aims at serving, among others, the following most popular high-level use cases for IPAs

    -
      -
    1. Question Answering or Information Retrieval -
    2. Executing local and/or remote services to accomplish tasks
    3. -
    -

    This is supported by a flexible architecture that supports dynamically adding local and remote services or knowledge sources such as data providers. Moreover, it is possible - to include other IPAs, with the same architecture, and forward requests to them, similar to the principle of a russian doll (omitting the Client Layer). - All this describes the capabilities of the IPA. These extensions may be selected from a - standardized marketplace. For the reminder of this document, we consider an IPA that is extendible via such a marketplace.

    - -

    The following table lists the IPA main use cases and related examples that are used in this document

    - - - - - - - - - - - - - -
    Main Use CaseExample
    Question Answering or Information RetrievalWeather information
    Executing local and/or remote services to accomplish tasksFlight reservation
    -

    These main use cases are shown in the following figure

    - Main IPA Use Cases - -

    Not all components may be needed for actual implementations, some may be omitted completely. However, we note them here to provide a more complete picture. - This architecture comprises three layers that are detailed in the following sections

    -
      -
    1. Client Layer
    2. -
    3. Dialog Layer
    4. -
    5. External Data / Services / IPA Providers
    6. -
    -

    Actual implementations may want to distinguish more than these layers. The assignment to the layers is not considered to be strict so that some of the components may be shifted - to other layers as needed. This view only reflects a view that the Community Group regard as ideal and to show the intended separation of concerns.

    - - IPA Major Components - -

    According to these components they are assigned to the packages shown below.

    - IPA Package Hierarchy - -

    4. High Level Interfaces

    - -

    This section details the interfaces from the figure shown in the architecture. The interfaces are described with the following attributes -

    -
    name
    -
    Name of the attribute
    -
    type
    -
    Hint if this attribute is a single data item or a category. A category may contain other categories or data items.
    -
    description
    -
    A short description to illustrate the purpose of this attribute.
    -
    required
    -
    Flag, if this attribute is required to be used in this interface.
    -
    - The data types of the attributes are left open for now. -

    - -

    A typical flow for the high level interfaces is shown in the following figure.

    - IPA Major Components Interaction -

    This sequence shows the support of the major use cases stated above at high level -

      -
    1. Question Answering or Information Retrieval -
    2. Executing local and/or remote services to accomplish tasks
    3. -
    -

    - -

    4.1 Interface Client Input

    -

    This interface describes the data that is sent from the IPA Client to the IPA Service. - The following table details the data that should be considered for this interface - in the method processInput

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    nametypedescriptionrequired
    session iddata itemunique identifier of the sessionyes, if obtained
    request iddata itemunique identifier of the request within a sessionyes
    audio datadata itemencoded or raw audio datayes
    multimodal inputcategoryinput that has been received from modality recognizers, e.g., text, gestures, pen input, ...no
    meta datacategorydata augmenting the request, e.g., user identification, timestamp, location, ...no
    - -

    The session id can be created by the IPA Service. In case a session id is provided, it must be used for subsequent calls.

    - -

    The IPA Client maintains request id for each request that is being sent via this interface. These ids must be unique within a - session.

    - -

    Audio data can be delivered mainly in two ways -

      -
    1. Endpointed audio data
    2. -
    3. Streamed audio data
    4. -

    - -

    For endpointed audio data the IPA Client determines the end of speech, e.g., with the help of voice activity detection. - In this case only that portion of audio is sent that contains the potential spoken user input. In this case, an audio codec may be used, e.g., to reduce - the amount of data to be transferred. In terms of user experience this means that processing of the user input can only happen after the end of - speech has been detected.

    - -

    For streamed audio data, the IPA Client starts sending audio data as soon as it has been detected that the user is speaking to - the system with the help of the Client Activation Strategy. In terms of user experience this means that - processing of the user input can happen while the user is speaking.

    - -

    Optionally, multimodal input can be transferred that has been captured as input from a specific modality recognizer. Modalities are all other - modalities but audio, e.g., text for a chat bot, or gestures.

    - -

    Optionally, meta data may be transferred augmenting the input. Examples of such data include user identification, timestamp and location.

    - -

    The data transferred via this interface mainly copies the data received from the Interface Client Input.

    - -

    The IPA Service may maintain a session id, e.g., to serve multiple clients and allow them to be distinguished.

    - -

    As a return value this interface describes the data that is sent from the IPA Service to the IPA Client. - The following table details the data that should be considered for this interface in the method deliverResponse.

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    nametypedescriptionrequired
    session iddata itemunique identifier of the sessionyes, if obtained
    request iddata itemunique identifier of the request within a sessionyes
    audio datadata itemencoded or raw audio datayes
    multimodal outputcategoryoutput that has been received from modality synthesizers, e.g., text, command to execute an observable action, ...no
    - -

    In case the parameter multimodal output contains commands to be executed, they are expected to follow the specification of the - Interface Service Call.

    - -

    The following sections will provide examples using the JSON format to illustrate the interfaces. JSON is only - chosen as it is easy to understand and read. This specification does not make any assumptions about the - underlying programming languages or data format. They are just meant to be an illustration of how responses may be generated with the provided data. - It is not required that implementations follow exactly the described behavior. -

    - -

    4.1.2 Example Weather Information for Interface Client Input

    - -

    - The following request to processInput sends endpointed audio data with the user's current location to query for tomorrow's weather with the utterance - What will the weather be like tomorrow". -

    +    

    + Interfaces are described with the help of UML diagrams. We + expect the reader to be familiar with that notation, although + most concepts are easy to understand and do not require in-depth + knowledge. The main diagram types used in this document are component + diagrams and sequence + diagrams. The UML diagrams are provided as Enterprise + Architect Model pa-architecture.EAP. + They can be viewed with the free of charge tool EA + Lite +

    + +

    + 2. Problem Statement +

    + +

    + 2.1 Use Cases +

    +

    This section describes potential usages of IPAs that will be + used later in the document to illustrate the usage of the + specified interfaces.

    + +

    + 2.1.1 Weather Information +

    + +

    A user located in Berlin, Germany, is planning to visit her + friend a few kilometers away, the next day. As she considers + taking the bike, she asks the IPA for weather conditions.

    + +

    + 2.1.2 Flight Reservation +

    + +

    A user located in Berlin, Germany, would like to plan a trio + to an international conference and she wants to book a flight to + the conference in San Francisco. Therefore, she approaches the + IPA to help her with booking the flight,

    + + +

    + 3. Architecture +

    + +

    + 3.1 Architectural Principle +

    + +

    + The architecture described in this document follows the SOLID + principle introduced by Robert C. Martin to arrive at a + scalable, understandable and reusable software solution. +

    +
    +
    Single responsibility principle
    +
    The components should have only one clearly-defined + responsibility.
    +
    Open closed principle
    +
    Components should be open for extension, but closed for + modification.
    +
    Liskov substitution principle
    +
    Components may be replaced without impacts onto the + basic system behavior.
    +
    Interface segregation principle
    +
    Many specific interfaces are better than one + general-purpose interface.
    +
    Dependency inversion principle
    +
    High-level components should not depend on low-level + components. Both should depend on their interfaces.
    +
    + +

    This architecture aims at following both, a traditional + partitioning of conversational systems, with separate components + for speech recognition, natural language understanding, dialog + management, natural language generation, and audio output, + (audio files or text to speech) as well as newer LLM (Large + Language Model) based approaches. This architecture does not + rule out combining some of these components in specific systems.

    + +

    + 3.2 Main Use Cases +

    + +

    Among others, the following most popular high-level use cases + for IPAs are to be supported

    +
      +
    1. Question Answering or Information Retrieval
    2. +
    3. Executing local and/or remote services to accomplish + tasks
    4. +
    +

    This is supported by a flexible architecture that supports + dynamically adding local and remote services or knowledge + sources such as data providers. Moreover, it is possible to + include other IPAs, with the same architecture, and forward + requests to them, similar to the principle of a Russian doll + (omitting the Client Layer). All this describes the capabilities + of the IPA. These extensions may be selected from a standardized + marketplace. For the reminder of this document, we consider an + IPA that is extendible via such a marketplace.

    + +

    The following table lists the IPA main use cases and related + examples that are used in this document

    + + + + + + + + + + + + + +
    Main Use CaseExample
    Question Answering or Information RetrievalWeather information
    Executing local and/or remote services to + accomplish tasksFlight reservation
    +

    These main use cases are shown in the following figure

    + Main IPA Use Cases + +

    Not all components may be needed for actual implementations, + some may be omitted completely. Especially, LLM-based + architectures may combine the functionality of multiple + components into only one or few components. However, we note + them here to provide a more complete picture.

    +

    The architecture comprises three layers that are detailed in + the following sections

    +
      +
    1. Client Layer
    2. +
    3. Dialog Layer
    4. +
    5. External Data / Services / IPA + Providers
    6. +
    +

    Actual implementations may want to distinguish more or fewer + than these layers. The assignment to the layers is not + considered to be strict so that some of the components may be + shifted to other layers as needed. This view only reflects a + view that the Community Group regard as ideal and to show the + intended separation of concerns.

    + IPA Major Components + +

    According to these components they are assigned to the + packages shown below.

    + IPA Package Hierarchy + +

    + 4. High Level Interfaces +

    + +

    + This section details the interfaces from the figure shown in the + architecture. The interfaces are + described with the following attributes +

    +
    +
    name
    +
    Name of the attribute
    +
    type
    +
    Hint if this attribute is a single data item or a + category. The exact data types of the attributes are left + open for now. A category may contain other categories or data + items.
    +
    description
    +
    A short description to illustrate the purpose of this + attribute.
    +
    required
    +
    Flag, if this attribute is required to be used in this + interface.
    +
    + +

    A typical flow for the high level interfaces is shown in the + following figure.

    + IPA Major Components Interaction +

    This sequence supports the major use cases stated + above.

    + +

    + 4.1 Interface Client Input +

    +

    + This interface describes the data that is sent from the IPA Client to the IPA Service. The following table + details the data that should be considered for this interface in + the method processInput +

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    nametypedescriptionrequired
    session iddata itemunique identifier of the sessionyes, if obtained
    request iddata itemunique identifier of the request within a sessionyes
    audio datadata itemencoded or raw audio datayes
    multimodal inputcategoryinput that has been received from modality + recognizers, e.g., text, gestures, pen input, ...no
    meta datacategorydata augmenting the request, e.g., user + identification, timestamp, location, ...no
    + +

    + The session id can be created by the IPA Service. In case a session id is + provided, it must be used for subsequent calls. +

    + +

    + The IPA Client maintains request + id for each request that is being sent via this interface. + These ids must be unique within a session. +

    + +

    + Audio data can be delivered mainly in two ways +

    +
      +
    1. Endpointed audio data
    2. +
    3. Streamed audio data
    4. +
    + +

    + For endpointed audio data the IPA + Client determines the end of speech, e.g., with the help of + voice activity detection. In this case only that portion of + audio is sent that contains the potential spoken user input.In + terms of user experience this means that processing of the user + input can only happen after the end of speech has + been detected. +

    + +

    + For streamed audio data, the IPA Client + starts sending audio data as soon as it has been detected that + the user is speaking to the system with the help of the Client Activation + Strategy. In terms of user experience this means that + processing of the user input can happen while the user is + speaking. +

    + +

    An audio codec may be used, e.g., to reduce the amount of + data to be transferred. The selection of the codec is not part + of this specification.

    + + Optionally, + multimodal input can be transferred that has + been captured as input from a specific modality recognizer. + Modalities are all other modalities but audio, e.g., text for a + chat bot, or gestures. +

    + +

    + Optionally, meta data may be transferred augmenting the + input. Examples of such data include user identification, + timestamp and location. +

    + +

    + The IPA Service may maintain a session + id, e.g., to serve multiple clients and allow them to be + distinguished. +

    + +

    + As a return value this interface describes the data that is sent + from the IPA Service to the IPA Client. The following table + details the data that should be considered for this interface in + the ClientResponse. +

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    nametypedescriptionrequired
    session iddata itemunique identifier of the sessionyes, if obtained
    request iddata itemunique identifier of the request within a sessionyes
    audio datadata itemencoded or raw audio datayes
    multimodal outputcategoryoutput that has been received from modality + synthesizers, e.g., text, command to execute an + observable action, ...no
    + +

    + In case the parameter multimodal output contains commands + to be executed, they are expected to follow the specification of + the Interface Service Call. +

    + +

    The following sections will provide examples using the JSON + format to illustrate the interfaces. JSON is only chosen as it + is easy to understand and read. This specification does not make + any assumptions about the underlying programming languages or + data format. They are just meant to be an illustration of how + responses may be generated with the provided data. It is not + required that implementations follow exactly the described + behavior.

    + +

    + 4.1.2 Example + Weather Information for Interface Client Input +

    + +

    + The following request to processInput sends endpointed + audio data with the user's current location to query for + tomorrow's weather with the utterance What will the + weather be like tomorrow".

    +
     {
     	"sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
     	"requestId": "42",
    @@ -388,16 +560,18 @@ 

    4.1.2 ... } }

    -

    - -

    In this example endpointed audio data is transfered as a value. There are other ways to - send the audio data to the IPA, e.g., as a reference. This way is chosen as it is easier to - illustrate the usage.

    - -

    - In return the the IPA may send back the following response Tomorrow there will be snow showers in Berlin with temperatures between 0 and -1 degrees - via deliverResponse to the Client. -

    +
    +    

    In this example endpointed audio data is transfered as a + value. There are other ways to send the audio data to the IPA, + e.g., as a reference. This way is chosen as it is easier to + illustrate the usage.

    + +

    + In return the the IPA may send back the following response Tomorrow + there will be snow showers in Berlin with temperatures + between 0 and -1 degrees via ClientResponse to the + Client.

    +
     {
     	"sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
     	"requestId": "42",
    @@ -414,14 +588,17 @@ 

    4.1.2 ... } }

    -

    - -

    4.1.3 Example Flight Reservation for Interface Client Input

    - -

    - The following request to processInput sends endpointed audio data with the user's current location to book a flight with the utterance - I want to fly to San Francisco. -

    +
    +    

    + 4.1.3 Example + Flight Reservation for Interface Client Input +

    + +

    + The following request to processInput sends endpointed + audio data with the user's current location to book a flight + with the utterance I want to fly to San Francisco. +

     {
     	"sessionId": "0c27895c-644d-11ed-81ce-0242ac120002",
     	"requestId": "15",
    @@ -442,12 +619,13 @@ 

    4.1.3 < ... } }

    -

    - -

    - In return the the IPA may send back the following response When do you want to fly from Berlin to San Francisco? - via deliverResponse to the Client -

    +    

    + +

    + In return the the IPA may send back the following response When + do you want to fly from Berlin to San Francisco? via ClientResponse + to the Client +

     {
     	"sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
     	"requestId": "42",
    @@ -466,8 +644,8 @@ 

    4.1.3 < }

    - 4.2 External - Client Input + 4.2 External Client Input

    This interface describes the data that is sent from t the Selection Service and the NLU and Dialog Management. The following table details the data that should be considered for - this interface in the method deliverSemanticInterpretation. + this interface in the method ExternalClientResponse.

    @@ -519,19 +697,21 @@

    - + - + + href="#errorhandling">Error Handling
    multimodal output category output that has been received from an external IPAyes, if no interpretation is providedyes, if no interpretation is provided and no error + occurred
    interpretation category meaning as intents and associated entitiesyes, if no multimodal output is providedyes, if no multimodal output is provided and no + error occurred
    error category error as detailed in section Error Handling" yes, if an error during execution is observed
    @@ -554,8 +734,9 @@

    The category interpretation may be one of the following - options, depending on the capabilities of the external IPA

    - + options, depending on the capabilities of the external IPA +

    +
    • single-intent, i.e. provide multiple intents in a single utterance
    • @@ -568,13 +749,13 @@

      utterance. An example for single-intent is "Book a flight to San Francisco for tomorrow morning." The single intent is here book-flight. With multi-intent the user - provides multiple intents in a single utterance. An example - for multi-intent is "How is the weather in San - Francisco and book a flight for tomorrow morning." Provided - intents are check-weather and book-flight. In this case the IPA - needs to determine the order of intent execution based on the - structure of the utterance. If not to be done in parallel, the - IPA will trigger the next intent in the identified order. + provides multiple intents in a single utterance. An example for + multi-intent is "How is the weather in San Francisco + and book a flight for tomorrow morning." Provided intents + are check-weather and book-flight. In this case the IPA needs to + determine the order of intent execution based on the structure + of the utterance. If not to be done in parallel, the IPA will + trigger the next intent in the identified order.

      @@ -633,15 +814,23 @@

      -

      4.2.1 Example Weather Information for Interface External Client Input

      - -

      - The following request to processInput is a copy of Example Weather Information for Interface Client Input. -

      - -

      - In return the the external IPA may send back the following response via deliverSemanticInterpretation to the Dialog. -

      +    

      + 4.2.1 Example Weather Information for + Interface External Client Input +

      + +

      + The following request to processInput is a copy of Example Weather + Information for Interface Client Input. +

      + +

      + In return the the external IPA may send back the following + response via ExternalClientResponse to the Dialog. +

      +
       {
           "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
           "requestId": "42",
      @@ -663,29 +852,43 @@ 

      4.2.1 -

      -

      The external speech recognizer converts the obtained audio into text like How will be the weather tomorrow. The NLU then extracts the following from that decoded - utterance, other multimodal input and metadata. -

        -
      • intent: check-weather from, e.g., utterance part How will the weather…
      • -
      • entity: date from utterance part …tomorrow…
      • -
      • entity: location, e.g., from the multimodal input of location
      • -
      - This is illustrated in the following figure. -

      - Processing Input of the check weather example - -

      4.2.2 Example Flight Reservation for Interface External Client Input

      - -

      - The following request to processInput is a copy of Example Flight Reservation for Interface Client Input. -

      - -

      - In return the the IPA may send back the following response When do you want to fly from Berlin to San Francisco? - via deliverResponse to the Client. In this case, empty entities, like date indicate that there are still slots to be filled and no service call - can be made right now. -

      +
      +    

      + The external speech recognizer converts the obtained audio into + text like How will be the weather tomorrow. The NLU + then extracts the following from that decoded utterance, other + multimodal input and metadata. +

      +
        +
      • intent: check-weather from, e.g., utterance part How + will the weather…
      • +
      • entity: date from utterance part …tomorrow…
      • +
      • entity: location, e.g., from the multimodal input of + location
      • +
      +

      This is illustrated in the following figure.

      + Processing Input of the check weather example + +

      + 4.2.2 Example + Flight Reservation for Interface External Client Input +

      + +

      + The following request to processInput is a copy of Example Flight + Reservation for Interface Client Input. +

      + +

      + In return the the IPA may send back the following response When + do you want to fly from Berlin to San Francisco? via ClientResponse + to the Client. In this case, empty entities, like date + indicate that there are still slots to be filled and no service + call can be made right now. +

       {
           "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
           "requestId": "42",
      @@ -712,19 +915,32 @@ 

      4.2.2 < ] }

      -

      The external speech recognizer converts the obtained audio into text like I want to fly to San Francisco. The NLU then extracts the following from that decoded - utterance, other multimodal input and metadata. -

        -
      • intent: book-fligh from, e.g., utterance part I want to fly…
      • -
      • entity: location from utterance part …San Francisco…
      • -
      • entity: location, e.g., from the multimodal input of location
      • -
      - This is illustrated in the following figure. -

      - Processing Input of the flight reservation example - -

      Further steps will be needed to convert both location entities to origin and destination in the actual reply. This may be - either done by the flight reservation IPA directly or by calling external services beforehand to determine the nearest airports from these locations.

      +

      + The external speech recognizer converts the obtained audio into + text like I want to fly to San Francisco. The NLU then + extracts the following from that decoded utterance, other + multimodal input and metadata. +

        +
      • intent: book-fligh from, e.g., utterance part I + want to fly…
      • +
      • entity: location from utterance part …San + Francisco…
      • +
      • entity: location, e.g., from the multimodal input of + location
      • +
      + This is illustrated in the following figure. +

      + Processing Input of the flight reservation example + +

      + Further steps will be needed to convert both location entities + to origin and destination in the actual reply. + This may be either done by the flight reservation IPA directly + or by calling external services beforehand to determine the + nearest airports from these locations. +

      4.3 External @@ -773,8 +989,8 @@

      - As a return value the result of this call is sent back in the - method deliverResponse. + As a return value the result of this call is sent back in the + ClientResponse.

      @@ -818,20 +1034,29 @@

      + href="#errorhandling">Error Handling
      error category error as detailed in section Error Handling" yes, if an error during execution is observed
      -

      This call is optional depending on the result of the next dialog step if an external service should be called or not.

      +

      This call is optional depending on the result of the next + dialog step if an external service should be called or not.

      + +

      + 4.3.1 Example Weather Information for + Interface Service Call +

      -

      4.3.1 Example Weather Information for Interface Service Call

      - -

      - The following request to callService may be made to call the weather information service. - Although calling the weather service is not a direct functionality of the IPA, it may help to understand - how the entered data may be processed to obtain a spoken reply to the user's input. -

      +    

      + The following request to callService may be made to call + the weather information service. Although calling the weather + service is not a direct functionality of the IPA, it may help to + understand how the entered data may be processed to obtain a + spoken reply to the user's input. +

      + +
       {
           "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
           "requestId": "42",
      @@ -848,11 +1073,13 @@ 

      4.3.1 -

      - -

      - In return the the external service may send back the following response via deliverResponse to the Dialog -

      +
      +    

      + In return the the external service may send back the following + response via ExternalClientResponse to the Dialog +

      + +
       {
           "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
           "requestId": "42",
      @@ -874,11 +1101,13 @@ 

      4.3.1 -

      -

      This information is the used to actually create a reply to the user as - described in deliverResponse to the Client.

      +

      + This information is the used to actually create a reply to the + user as described in ExternalClientResponse + to the Client. +

      -

      Error Handling

      +

      4.4.Error Handling

      Errors may occur anywhere in the processing chain of the IPA. The following gives an overview of how they are suggested to be handled.

      @@ -921,75 +1150,131 @@

      Error Handling

      -

      5. Low Level Interfaces

      - -

      This section is still under preparation.

      - -

      5.1. Client Layer

      -

      The Client Layer contains the main components that interface with the user.

      - - Client Component - -

      5.1.1 IPA Client

      -

      Clients enable the user to access the IPA via voice. The following diagram provides some more insight.

      - IPA Client - -

      5.1.1.1 Modality Manager

      -

      The modality manager enables access to the modalities that are supported by the IPA Client. Major modalities are voice and text in case of chatbots. The following interfaces are supported -

        -
      • Client Interaction
      • -
      • Handle-xxx-Modality
      • -
      -

      - -

      5.1.1.2 Client Activation Strategy

      -

      The Client Activation Strategy defines how the client gets activated to be ready to receive spoken commands as input. In turn the Microphone - is opened for recording. Client Activation Strategies are not exclusive but may be used concurrently. The most common activation strategies are described in the - table below.

      - - - - - - - - - - - - - - - - - - - - - -
      Client Activation StrategyDescription
      Push-to-talkThe user explicitly triggers the start of the client by means of a physical or on-screen button or its equivalent in a client application.
      HotwordIn this case, the user utters a predefined word or phrase to activate the client by voice. Hotwords may also be used to preselect a known - IPA Provider. In this case the identifier of that IPA Provider is also used as additional metadata - augmenting the input - This hotword is usually not part of the spoken command that is passed for further evaluation.
      Local Data ProvidersIn this case, a change in the environment may activate the client, for example if the user enters a room.
      ......
      -

      The usage of hotwords includes privacy aspects as the microphone needs to be always active. Streaming to the components outside the user's control should be avoided, hence detection of hotwords should ideally happen locally. - With regard to nested usage of IPAs that may feature their own hotwords, the detection of hotwords might be required to be extensible.

      - -

      5.2 Dialog Layer

      -

      The Dialog Layer contains the main components to drive the interaction with the user.

      - Dialog Component - -

      5.2.1 IPA Service

      - -

      5.2.2 ASR

      - -

      5.2.3 NLU

      - -

      5.2.4 Dialog Management

      - -

      5.3 External Data / Services / IPA Providers

      - External Data / Services / IPA Providers Component - -

      5.3.1 Provider Selection Service

      - - - \ No newline at end of file +

      + 5. Low Level Interfaces +

      + +

      This section is still under preparation.

      + +

      + 5.1. Client Layer +

      +

      The Client Layer contains the main components that interface + with the user.

      + + Client Component + +

      + 5.1.1 IPA Client +

      +

      Clients enable the user to access the IPA via voice. The + following diagram provides some more insight.

      + IPA Client + +

      + 5.1.1.1 Modality Manager +

      +

      The modality manager enables access to the modalities that + are supported by the IPA Client. Major modalities are voice and + text in case of chatbots. The following interfaces are supported + +

        +
      • Client Interaction
      • +
      • Handle-xxx-Modality
      • +
      +

      + +

      + 5.1.1.2 Client Activation Strategy +

      +

      + The Client Activation Strategy defines how the client gets + activated to be ready to receive spoken commands as input. In + turn the Microphone is opened for + recording. Client Activation Strategies are not exclusive but + may be used concurrently. The most common activation strategies + are described in the table below. +

      + + + + + + + + + + + + + + + + + + + + + +
      Client Activation StrategyDescription
      Push-to-talkThe user explicitly triggers the start of the + client by means of a physical or on-screen button or its + equivalent in a client application.
      HotwordIn this case, the user utters a predefined word or + phrase to activate the client by voice. Hotwords may + also be used to preselect a known IPA + Provider. In this case the identifier of that IPA Provider is also used as + additional metadata augmenting the input This hotword is + usually not part of the spoken command that is passed + for further evaluation. +
      Local Data + ProvidersIn this case, a change in the environment may + activate the client, for example if the user enters a + room.
      ......
      +

      The usage of hotwords includes privacy aspects as the + microphone needs to be always active. Streaming to the + components outside the user's control should be avoided, hence + detection of hotwords should ideally happen locally. With regard + to nested usage of IPAs that may feature their own hotwords, the + detection of hotwords might be required to be extensible.

      + +

      + 5.2 Dialog Layer +

      +

      The Dialog Layer contains the main components to drive the + interaction with the user.

      + Dialog Component + +

      + 5.2.1 IPA Service +

      + +

      + 5.2.2 ASR +

      + +

      + 5.2.3 NLU +

      + +

      + 5.2.4 Dialog Management +

      + +

      + 5.3 External Data / Services / IPA + Providers +

      + External Data / Services / IPA Providers Component + +

      + 5.3.1 Provider Selection Service +

      + + + + \ No newline at end of file