Firm foundation in the main hci principles, the book provides a working

bet	85/97
Sana	23.09.2023
Hajmi	4.23 Mb.
	#1685852

1 ... 81 82 83 84 85 86 87 88 ... 97

Bog'liq
Human Computer Interaction Fundamentals

Figure 9.2 Four emerging computing platforms and associated HCI technologies to pay atten-
tion to in the next 10 years: high-quality cloud service and ubiquitous and mobile interaction clients,
experiential and natural user interfaces.

141
F U T U R E O F H C I
sense out of the sentence, which is composed of a sequence of rec-
ognized words (usually known as natural language understanding).
Surely word recognition (which could be spoken, written, or printed)
is the prerequisite to the sentence understanding. (Here we focus
only on the spoken word or voice recognition.) Voice-recognition
performance and its practicality are dependent on the target number
of words to be recognized, the number of speakers, the level of the
noise in the usage environment, and the need for any special devices
(e.g., noise-canceling microphone). The current state of the art seems
to be (a) over 95% recognition rate (individual words) for (b) at least
millions of words and more than 30 languages (c) in real time (through
the high-performance cloud) (d) without speaker-specific training (by
age, gender, dialects) (e) in a midlevel noisy environment (e.g., office
with ambient noise of around 30–40 dB) and (f) with the words spo-
ken relatively closely to cheap noise-canceling microphones or soft-
ware [2]. Such a state of the art seems to be quite sufficient for a more
widespread presence of voice recognition in our current lives, but it is
not so except for special situations of disability support or for operat-
ing constraints in which both hands are occupied. One main reason
seems to be that the users are less tolerant to the 2%–3% of incor-
rect recognition performance, even though humans themselves do
not possess 100% word-recognition capability. Another reason might
have to do with the segmentation problem. Often, voice recognition
requires a mode during which the input is given in an explicit way,
because otherwise it is quite difficult to separate and segregate the
actual voice input from the rest (noise, normal conversation) within
the stream of voice. The entrance into this mode will typically involve
simple additional actions, such as a button push/release. However,
users take this to be a significant nuisance in usage.
One way to overcome this problem is to rely more on multimo-
dality. To eliminate the segmentation problem, the voice input can
be accompanied by certain other modal actions, such as a gesture/
posture and lip movements within a given context so that it is distin-
guished from noise, other people’s speech, or unrelated conversation.
We will discuss this multimodal integration in Section 9.1.4.
While isolated word recognition is approaching a nearly 100%
accuracy rate, when trying to understand a whole sentence, individual
words need to be recognized from a continuous stream of words. By a

14 2
H U M A N – C O M P U T E R I N T E R A C T I O N
simple calculation, we can easily see that recognizing a sentence with
five words, with each word having a recognition rate of 90%, will
yield a success rate of only 0.9
5
= 0.59 success rate. Add the problem
of extracting the meaning of the whole sentence, and now we have an
even lower success rate in the correct natural language understanding.
Despite these difficulties, due to its huge potential, great efforts are con-
tinually being made to improve the situation. The recent cases of Apple®
Siri [3] and IBM® Watson [1] illustrate the bright future we have with
regard to voice/language understanding. Apple Siri understands continu-
ously spoken words and understands them with higher accuracy by incor-
porating the contextual knowledge of mobile device usage. IBM Watson
showcased a very fast understanding of the questions asked in natural
language in its bout with the human champion (however, the questions
were asked in text, not in voice). While the computer used in the quiz
contest was a near supercomputer-level server, IBM is developing a more
compact and lighter version specialized to a specific and practical domain
such as medical expert systems and IPTV (Internet protocol television)
interaction [4]. AT&T provides a similar voice/language-understanding
architecture for mobile phone usage, as shown in Figure 9.3.
9.1.2 Gestures
Gestures play a very important role in human communication, in many
cases unknowingly. Gestures alone can convey meaning, or they can
function in a supplemental role in other modes of communication.
Consequently, the objective of incorporating gestures into human-
computer interaction is a natural outcome. While there may be many
User speaks to the
Mobile App
Mobile App captures the audio input
and sends it to the WATSON server
The WATSON server sends back the
recognition results
The WATSON server carries out
the recognition computation
and extracts the meaning and
associated data

Download 4.23 Mb.

Do'stlaringiz bilan baham:

1 ... 81 82 83 84 85 86 87 88 ... 97