Firm foundation in the main hci principles, the book provides a working
Download 4.23 Mb. Pdf ko'rish
|
Human Computer Interaction Fundamentals
Figure 9.2 Four emerging computing platforms and associated HCI technologies to pay atten-
tion to in the next 10 years: high-quality cloud service and ubiquitous and mobile interaction clients, experiential and natural user interfaces. 141 F U T U R E O F H C I sense out of the sentence, which is composed of a sequence of rec- ognized words (usually known as natural language understanding). Surely word recognition (which could be spoken, written, or printed) is the prerequisite to the sentence understanding. (Here we focus only on the spoken word or voice recognition.) Voice-recognition performance and its practicality are dependent on the target number of words to be recognized, the number of speakers, the level of the noise in the usage environment, and the need for any special devices (e.g., noise-canceling microphone). The current state of the art seems to be (a) over 95% recognition rate (individual words) for (b) at least millions of words and more than 30 languages (c) in real time (through the high-performance cloud) (d) without speaker-specific training (by age, gender, dialects) (e) in a midlevel noisy environment (e.g., office with ambient noise of around 30–40 dB) and (f) with the words spo- ken relatively closely to cheap noise-canceling microphones or soft- ware [2]. Such a state of the art seems to be quite sufficient for a more widespread presence of voice recognition in our current lives, but it is not so except for special situations of disability support or for operat- ing constraints in which both hands are occupied. One main reason seems to be that the users are less tolerant to the 2%–3% of incor- rect recognition performance, even though humans themselves do not possess 100% word-recognition capability. Another reason might have to do with the segmentation problem. Often, voice recognition requires a mode during which the input is given in an explicit way, because otherwise it is quite difficult to separate and segregate the actual voice input from the rest (noise, normal conversation) within the stream of voice. The entrance into this mode will typically involve simple additional actions, such as a button push/release. However, users take this to be a significant nuisance in usage. One way to overcome this problem is to rely more on multimo- dality. To eliminate the segmentation problem, the voice input can be accompanied by certain other modal actions, such as a gesture/ posture and lip movements within a given context so that it is distin- guished from noise, other people’s speech, or unrelated conversation. We will discuss this multimodal integration in Section 9.1.4. While isolated word recognition is approaching a nearly 100% accuracy rate, when trying to understand a whole sentence, individual words need to be recognized from a continuous stream of words. By a 14 2 H U M A N – C O M P U T E R I N T E R A C T I O N simple calculation, we can easily see that recognizing a sentence with five words, with each word having a recognition rate of 90%, will yield a success rate of only 0.9 5 = 0.59 success rate. Add the problem of extracting the meaning of the whole sentence, and now we have an even lower success rate in the correct natural language understanding. Despite these difficulties, due to its huge potential, great efforts are con- tinually being made to improve the situation. The recent cases of Apple® Siri [3] and IBM® Watson [1] illustrate the bright future we have with regard to voice/language understanding. Apple Siri understands continu- ously spoken words and understands them with higher accuracy by incor- porating the contextual knowledge of mobile device usage. IBM Watson showcased a very fast understanding of the questions asked in natural language in its bout with the human champion (however, the questions were asked in text, not in voice). While the computer used in the quiz contest was a near supercomputer-level server, IBM is developing a more compact and lighter version specialized to a specific and practical domain such as medical expert systems and IPTV (Internet protocol television) interaction [4]. AT&T provides a similar voice/language-understanding architecture for mobile phone usage, as shown in Figure 9.3. 9.1.2 Gestures Gestures play a very important role in human communication, in many cases unknowingly. Gestures alone can convey meaning, or they can function in a supplemental role in other modes of communication. Consequently, the objective of incorporating gestures into human- computer interaction is a natural outcome. While there may be many User speaks to the Mobile App Mobile App captures the audio input and sends it to the WATSON server The WATSON server sends back the recognition results The WATSON server carries out the recognition computation and extracts the meaning and associated data Download 4.23 Mb. Do'stlaringiz bilan baham: |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling