Virtual Assistants

Intelligent Personal Assistants vs. Chatbots

An intelligent personal assistant is software that can assist people with basic tasks, usually using natural language. They can perform several tasks such as internal commands, web access. Either text or voice can trigger an action.

A voice assistant is an intelligent personal assistant that uses voice recognition, speech synthesis, and natural language processing to provide a service through a particular application.

A chatbot also known as a conversational agent is the main way to get assistance from a chatbot. Chatbots can simulate a conversation with a human user. Many companies use them in the customer service sector to answer basic questions and connect with a live person if necessary.

Intelligent Personal Assistant:

Simple dialog (linear/slot filling)
Adaptable to owner

Chatbot:

Often associated with a company
usually replaces a customer service support person
more complex dialog with multiple speech turns

Wake word detection

The wake word is what wakes the assistant. For example with the Google Assistant the wake word is "Hey Google".

Automatic Speech recognition

Current approaches for ASR are machine learning to learn words from speech signals. Frequency analysis is performed on each little slice of audio to learn the corresponding letters

Speech recognition

Speaker

Speaker's characteristics
- Age, gender, anatomy (vocal chords)
Current state
- level of stress, emotional state
Language and culture

Utterance

Utterance method
- Isolated words vs connected words
Utterance context
- continuous speech vs spontaneous speech
Production characteristics
- whisper vs normal voice vs scream

Discourse Content

domain
- the smaller the domain, the better the system performance
- larger domain -> more homophones
vocabulary
- larger the vocab the more understanding

Environment

surrounding sounds
transducer
- phone, headset, smart Speaker
channel variability
- signal distortion
- echo

Intent detection

Intent detection is a problem of understanding the language (NLU).

NLU is very complex because the language is full of ambiguity.

Types of dialogs

Linear dialogs is a process of collecting all the necessary information to complete an action. This is used by intelligent personal assistants.

Non-linear dialogs are closer to "real" conversation with branches, twists and turns, based on changes to context. This is used by a conversational agent (chatbot). They are very complex to build, and can quickly become unmanageable.

Action

For a chatbot or a intelligent personal assistants to execute a desired action, it could require:

Access to internal commands
access to external suppliers
access to knowledge based
access to search engines

Access to internal commands

access to "standard" or added applications
internal commands like "set timer"
other behavior
- "tell me a joke"

Access to external suppliers

simple tasks requiring a service provider
complex task requiring exchange of information with a service provider

Access to knowledge base

Factual Q/A

Access to search engines

web searches

Answer Generation

Types of answers:

Answers that are pre-recorded
answer templates with variables

Speech Synthesis

Speech synthesis is a form of output where a computer or other machine reads words to you out loud in a real or simulated voice played through a loudspeaker; the technology is often called text-to-speech (TTS).

Steps for speech synthesis include: preprocessing for disambiguation, words to phonemes, phonemes to sounds, modulation for intonation, volume.

Preprocessing

Preprocessing involves going through the text and cleaning it up so the computer makes fewer mistakes when it actually reads the words aloud.

Things like numbers, dates, times, abbreviations, acronyms, and special characters need to be turned into words.

Words to phonemes

These are the sound parts of word.

Five -> f-ay-v One -> w-ah-n

Phonemes to sounds

Speech synthesizers that use recorded human voices have to be preloaded with little snippets of human sound they can rearrange.

In other words, a programmer has to record lots of examples of a person saying different things, break the spoken sentences into words and the words into phonemes.

If there are enough speech samples, the computer can rearrange the bits in any number of different ways to create entirely new words and sentences.

Modulation

As any good actor can demonstrate, a sentence can be read in different ways depending on the meaning of the text, the person speaking and the emotions it wishes to express.

In linguistics, this idea is known as prosody and is one of the most difficult problems for voice synthesizers.

Data Staging

Speech Recognition