Intelligent Personal Assistant

Introduction

Intelligent Personal Assistant

This is software that can assist people with basic tasks, usually using natural language. Intelligent personal assistants can perform several tasks (internal commands, web access, Q/A, etc.). Either text or voice can trigger an action.

Voice Assistant

The key here is voice. A voice assistant is an intelligent personal assistant that uses voice recognition, speech synthesis, and natural language processing (NLP) to provide a service through a particular application.

Use of Voice assistants

Stand alone (internal commands)

open facebook
set the timer for 2 minutes

Application developed for the platform which can take voice commands (skills)

Web searches

search for a thai restaurant in Ottawa

Factual Q/A

Who is the prime minister of Canada ?
What is the Coen brothers latest movie ?

Chatbot / Conversational Agent

Text is the main way to get assistance from a chatbot. Chatbots can simulate a conversation with a human user. Many companies use them in the customer service sector to answer basic questions and connect with a live person if necessary.

Intelligent Personal Assistant

Simple dialog (linear/slot filling)
Adaptability to the owner

Chatbot

Often associated with a company (bank, reseller).
The chatbot replaces a customer service or technical support person.
More complex dialog with multiple speech turns.

Architecture and Wake Word Detection

Basic Architecture

Wake word Detection
Speech to test conversion, Automatic Speech Recognition (ASR)
Intent Detection, Natural Language Detection (NLU)
Action
Answer Generation, Natural Language Generation (NLG)
Speech synthesis, Text To Speech (TTS)

Wake word detection

Automatic Speech Recognition

Machine Learning is used to learn words from speech signals.

Frequency analysis is performed on each little slice of audio to learn the corresponding letters.

Characterizing elements

Speaker
- Speaker's characteristic
  - Age, gender, anatomy (vocal cords)
- Current state
  - Level of stress, emotional state
- Language and culture
  - American English, British English, Australian English
Utterance
- Utterance method
  - Isolated words vs Connected words
- Utterance context
  - Continuous Speech vs Spontaneous Speech
- Production characteristics
  - Whisper vs normal voice vs scream
Discourse Content
- Domain
  - The smaller the domain, the better the system performance (less ambiguity).
- Vocabulary:
  - Small vocabulary - tens of words
  - Medium vocabulary - hundreds of words
  - Large vocabulary - thousands of words
  - Very-large vocabulary - tens of thousands of words
Environment
- Surrounding sounds
  - a clock ticking, a computer humming, a radio playing somewhere down the corridor, another human speaker in the background etc.
- Transducer
  - Phone, Headset, Smart Speaker
- Channel variability
  - Signal distortion (perhaps f- rom the transducer)
  - Echo (maybe coming from the room

Intent Detection

Intent detection/classification is a problem of understanding the language (NLU).

NLU is very complex because the language is full of ambiguity (vague words, synonyms, paraphrases).

A "restaurant_search" intention can be expressed in different ways:

I'm hungry!
Show me some good pizza places
I want to eat sushi with my friend
Are there any good Italian restaurants near here?

Linear dialogs

Process of collecting all the necessary information to complete an action (e.g. booking an appointment or making an order).

Non-linear dialogs

Closer to a "real" conversation with branches, twists and turns, based on changes and context.

Very complex to build
Quickly become unmanageable after 4-5 speech turn
Keeping track of the "state" is hard
- Which slots are filled
- Also, what was tried in the slots and failed
  - users should be reminded - We tried 9pm before and it was unavailable
- Careful of looping

Hybrid chatbot with menus

If the conversation must lead to an action (fulfillment), we want to avoid multiple turns with the service that will provide the action. An alternative is an hybrid conversation agent which would sometimes give menu options.

Action

To execute the desired action could require:

Access to internal commands
Access to external suppliers
Access to knowledge bases
Access to search engines

Access to internal commands:

Access to “standard” or added applications
open facebook
Skills (applications) developed by different people
Internal commands
- set the timer for 2 minutes
Other behavior
- Tell me a joke

Access to external suppliers:

Simple tasks requiring a service provider (3rd party)
- What is the weather tomorrow ?
- Read the latest sport news on CBC
Complex tasks requiring exchange of information with a service provider (3rd party)
- I want to reserve a table for 2 at Gezellig

Access to knowledge bases:

Factual Q/A
- Who is the prime minister of Canada ?
- What is the Coen brothers latest movie ?

Access to search engines:

Web searches
- search for a thai restaurant in Ottawa
Factual Q/A (not answered by KB)
- Is there a vaccine for COVID-19?

Answer Generation

Generating a response

Types of answers:

Complete answers pre-recorded (pre-coded).
Answer templates with variables.

Complete pre-recorded answers

Lists of stories or jokes.
- Tell me a joke.
- "Two guys who are going ....“
Answers for smalltalk
- "Can I help you?".
- How are you?
- "I'm fine and you.“
Answer of excuse
- "I do not understand your question.“
Answer of accomplishment.
- “OK. Done”

Answer templates

Replacing variables by the extracted entities (during intent detection)
- Play music by Brad Mehldau.
  - Intent "play_music"
  - Artist: Brad Mehldau
- Template: "Here's X's music."
  - Instantiated template: "Here's Brad Mehldau's music."

Example based on patterns for the weather

(A="It is", D="Tomorrow will be", M="Wednesday will be") a (sunny, cloudy, rainy) day and X degrees celsius.

Speech Synthesis

Speech synthesis is a form of output where a computer or other machine reads words to you out loud in a real or simulated voice played through a loudspeaker; the technology is often called text-to-speech (TTS).

Steps for Speech Synthesis

Pre-processing for disambiguation

Preprocessing involves going through the text and cleaning it up so the computer makes fewer mistakes when it actually reads the words aloud.

Things like numbers, dates, times, abbreviations, acronyms, and special characters (currency symbols and so on) need to be turned into words—and that's harder than it sounds.

Example: The number 1843 might refer to

a quantity of items ("one thousand eight hundred and forty three")
a year ("eighteen forty three")
or a padlock combination ("one eight four three")

each of which is read out slightly differently.

Words to phonemes

Breaking words down into sounds

Phonemes to sounds

Concatenative Approach

Speech synthesizers that use recorded human voices have to be preloaded with little snippets of human sound they can rearrange.

In other words, a programmer has to record lots of examples of a person saying different things, break the spoken sentences into words and the words into phonemes.

If there are enough speech samples, the computer can rearrange the bits in any number of different ways to create entirely new words and sentences.

Modulation for intonation, volume

As any good actor can demonstrate, a sentence can be read in different ways depending on the meaning of the text, the person speaking and the emotions it wishes to express.

In linguistics, this idea is known as prosody and is one of the most difficult problems for voice synthesizers.

Accessibility, Internalization and Emotions

UI Development