Firm foundation in the main hci principles, the book provides a working


Download 4.23 Mb.
Pdf ko'rish
bet85/97
Sana23.09.2023
Hajmi4.23 Mb.
#1685852
1   ...   81   82   83   84   85   86   87   88   ...   97
Bog'liq
Human Computer Interaction Fundamentals

Figure 9.2 Four emerging computing platforms and associated HCI technologies to pay atten-
tion to in the next 10 years: high-quality cloud service and ubiquitous and mobile interaction clients, 
experiential and natural user interfaces.


141
F U T U R E O F H C I
sense out of the sentence, which is composed of a sequence of rec-
ognized words (usually known as natural language understanding). 
Surely word recognition (which could be spoken, written, or printed) 
is the prerequisite to the sentence understanding. (Here we focus 
only on the spoken word or voice recognition.) Voice-recognition 
performance and its practicality are dependent on the target number 
of words to be recognized, the number of speakers, the level of the 
noise in the usage environment, and the need for any special devices 
(e.g., noise-canceling microphone). The current state of the art seems 
to be (a) over 95% recognition rate (individual words) for (b) at least 
millions of words and more than 30 languages (c) in real time (through 
the high-performance cloud) (d) without speaker-specific training (by 
age, gender, dialects) (e) in a midlevel noisy environment (e.g., office 
with ambient noise of around 30–40 dB) and (f) with the words spo-
ken relatively closely to cheap noise-canceling microphones or soft-
ware [2]. Such a state of the art seems to be quite sufficient for a more 
widespread presence of voice recognition in our current lives, but it is 
not so except for special situations of disability support or for operat-
ing constraints in which both hands are occupied. One main reason 
seems to be that the users are less tolerant to the 2%–3% of incor-
rect recognition performance, even though humans themselves do 
not possess 100% word-recognition capability. Another reason might 
have to do with the segmentation problem. Often, voice recognition 
requires a mode during which the input is given in an explicit way, 
because otherwise it is quite difficult to separate and segregate the 
actual voice input from the rest (noise, normal conversation) within 
the stream of voice. The entrance into this mode will typically involve 
simple additional actions, such as a button push/release. However, 
users take this to be a significant nuisance in usage.
One way to overcome this problem is to rely more on multimo-
dality. To eliminate the segmentation problem, the voice input can 
be accompanied by certain other modal actions, such as a gesture/
posture and lip movements within a given context so that it is distin-
guished from noise, other people’s speech, or unrelated conversation. 
We will discuss this multimodal integration in Section 9.1.4.
While isolated word recognition is approaching a nearly 100% 
accuracy rate, when trying to understand a whole sentence, individual 
words need to be recognized from a continuous stream of words. By a 


14 2
H U M A N – C O M P U T E R I N T E R A C T I O N 
simple calculation, we can easily see that recognizing a sentence with 
five words, with each word having a recognition rate of 90%, will 
yield a success rate of only 0.9
5
= 0.59 success rate. Add the problem 
of extracting the meaning of the whole sentence, and now we have an 
even lower success rate in the correct natural language understanding.
Despite these difficulties, due to its huge potential, great efforts are con-
tinually being made to improve the situation. The recent cases of Apple® 
Siri [3] and IBM® Watson [1] illustrate the bright future we have with 
regard to voice/language understanding. Apple Siri understands continu-
ously spoken words and understands them with higher accuracy by incor-
porating the contextual knowledge of mobile device usage. IBM Watson 
showcased a very fast understanding of the questions asked in natural 
language in its bout with the human champion (however, the questions 
were asked in text, not in voice). While the computer used in the quiz 
contest was a near supercomputer-level server, IBM is developing a more 
compact and lighter version specialized to a specific and practical domain 
such as medical expert systems and IPTV (Internet protocol television) 
interaction [4]. AT&T provides a similar voice/language-understanding 
architecture for mobile phone usage, as shown in Figure 9.3.
9.1.2 Gestures
Gestures play a very important role in human communication, in many 
cases unknowingly. Gestures alone can convey meaning, or they can 
function in a supplemental role in other modes of communication. 
Consequently, the objective of incorporating gestures into human-
computer interaction is a natural outcome. While there may be many 
User speaks to the
Mobile App
Mobile App captures the audio input
and sends it to the WATSON server
The WATSON server sends back the
recognition results
The WATSON server carries out
the recognition computation
and extracts the meaning and
associated data

Download 4.23 Mb.

Do'stlaringiz bilan baham:
1   ...   81   82   83   84   85   86   87   88   ...   97




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling