1. User Engagement: AI Avatars

A user approaches the interactive super-terminal and is greeted by the VABot Ai, represented by a "talking head" or Avatar, on screen.

The user then initiates interaction either through a graphical user interface (GUI), by touching the screen or via voice input, or by using the microphone array built into the terminal.

  1. Input Processing:

The voice input is captured by the microphone array and transmitted to the speech recognition module.

Simultaneously, the video camera may capture visual cues, and the touchscreen registers the user's selection or input.

  1. Hardware and Software Integration:

The terminals' hardware components (microphone, camera, touchscreen, and speakers) work in conjunction with the software components.

The software processes the inputs through audio-visual speech recognition, user detection and tracking, and also finger-touch processing. [OS can be Andriod and Windows].

  1. Response Generation:

The dialogue model stored within the system's application data determines the VABot’s response to the user's input.

Speech synthesis and facial synthesis are employed to generate a coherent verbal and visual response from the VABot Ai.

  1. Multi-modal Output:

The terminal response is output through the speakers as synthesized speech.

The "talking head", Virtual Assistamt Avatar (VABot Ai) on the terminal's screen provides a visual representation of the dialogue, with facial expressions and lip movements synchronized to the audio output.

  1. Dialogue Management:

The VABot Ai dialogue manager ensures that the conversation flows logically, handling the user's queries and commands effectively. It manages the synchronization and fusion of different modalities (audio, visual, text) to maintain an engaging interaction.

  1. Service Completion:

The user receives the requested information or service via the AI powered Virtual Assistant Avatar. The interaction concludes with the user's needs being met, or with guidance on further steps (e.g., printing a ticket, completing a transaction).

Throughout these steps, the terminal utilizes NVIDIA's deep learning and AI capabilities to process and generate the avatar's responses, ensuring a responsive and personalized interaction. This integration of multi-modal inputs and outputs exemplifies how AI can enhance customer service experiences across various sectors and vertical industries.

General Structure of Multi-Voice, Multi-Face Audio-Visual TTS Synthesis

Orthographic Text: The process begins with orthographic text, which is the written form of the language.

Textual Processor: This component processes the orthographic text, considering vocabulary, morphology, and syntax rules to produce prosodically marked orthographic text. Prosody refers to the patterns of stress and intonation in a language.

Phonemic Processor: The prosodically marked text is then transformed into a phonemic representation. This involves converting letters to phonemes (the smallest units of sound) and then to allophones (variant forms of a phoneme).

Prosodic Processor: Parallel to phonemic processing, the prosodic processor uses the marked text to determine the prosodic parameters, such as intonation (F0), timing (A), and loudness (T) values. These parameters are crucial for generating natural-sounding speech.

Allophonic Text with Prosodic Parameters: The phonemic processor and the prosodic processor outputs are combined to create allophonic text that includes prosodic parameters. This text now has detailed instructions on how it should be spoken.

Acoustical Processor: This processor takes the allophonic text and uses a database of acoustic units (AUPs) to produce the speech signal. The AUP database contains rules for mapping allophones to speech sounds.

Visual Processor: Simultaneously, the visual processor uses a database of nonverbal facial movements (NFMs) to animate a face, syncing it with the speech signal. The NFMs database has concatenation rules for creating smooth transitions between facial expressions.

Animated Face and Speech Signal: The outputs of the acoustical and visual processors are combined to produce a synchronized animated face and speech signal. This results in a multi-modal output that mimics human speech both audibly and visually.

Key components of VABot AI Powered Consumer Terminals:

The VABot Ai project presents an innovative approach to multi-modal information terminals (also known as kiosks), focusing on audio-visual speech recognition and synthesis.

User Interaction Process: This involves the user selecting an avatar and providing voice input, which is crucial for initiating the interactive experience.

Voice to Text Conversion: This step is essential as it translates user voice input into text, allowing the system to process and understand user requests.

Large Language Model (LLM) Processing: The LLM interprets the user's request and decides on the availability of the requested item or information.

Item Retrieval or Recommendation: Depending on the LLM's decision, the system either shows the item or suggests alternatives.

Text to Speech and Video Generation: This dual process converts the final text content into speech and simultaneously generates a corresponding video with the avatar speaking the output.

User Output Delivery: The system presents an integrated audiovisual representation of the information to the user.

Technical Specifications: It's important to note the system's robust requirements like NVIDIA GPUs, RAM, processing power, storage, and compatibility with various operating systems and networks.

User Interaction Process: This graph below illustrates the sequential steps a user goes through when interacting with the system. It starts with avatar selection, progresses to voice input, and ends with UI interaction. This sequence underlines the user-centric design of the VABot Ai interactiuve terminals.

System Technical Specifications: This also graph shows the technical specifications required for VABot Ai super-terminals. It includes RAM (Minimum 32 GB for optimal performance), CPU (Intel i7 or Ryzen 7 Series for enhanced performance), GPU (NVIDIA GPUs with CUDA support for efficient parallel processing), Storage (at least 512 GB SSD for faster data access), and OS Compatibility (supporting various operating systems for broad accessibility).

Text to Speech and Video Generation Process: This graph below also depicts the process flow from text conversion to speech synthesis and finally to video generation. This demonstrates how the system transforms user input into a multi-modal output, emphasizing the advanced technological integration in VABot Ai.

Last updated