Brief Introduction to cslt


Download 4.54 Mb.
Sana10.01.2019
Hajmi4.54 Mb.



Brief Introduction to CSLT

  • Brief Introduction to CSLT

  • Speech and Language Processing @ CSLT

  • Database Creation and Standardization Activities





Mission:

  • Mission:

    • To develop advanced speech and language processing technology to meet the growing demand for human-computer interaction anywhere, anytime, and in any way.
    • To focus on multi-lingual and multi-platform speech recognition, pattern recognition of multi-modal biometric features, and natural language processing.
  • History:

    • Founded in February 2007, with
    • faculty members from research groups including:
      • the Center for Speech Technology (CST, founded in 1979, second early in China), the State Key Laboratory of Intelligent Technology and Systems (SKLits, ranked A for all 3 times’ competition), Department of Computer Science and Technology
      • the Speech Processing Technology Group (founded in 1986) and the Speech-on-Chip Group, Department of Electronic Engineering,
      • the Future Information Technology (FIT) R&D Center, Research Institute of Information Technology (RIIT), as well as
      • the Division of Computer Science and Artificial Intelligence of Tsinghua National Laboratory for Information Science and Technology.


Keeping in mind of “application, innovation focus and accumulation”, CSLT directs its research efforts on automatic speech recognition (ASR), voiceprint recognition (VPR) and natural language Processing (NLP).

  • Keeping in mind of “application, innovation focus and accumulation”, CSLT directs its research efforts on automatic speech recognition (ASR), voiceprint recognition (VPR) and natural language Processing (NLP).

    • 面向应用 -- base on the applications,
    • 推进创新 -- advocate the innovations,
    • 突出重点 -- focus on the emphases, and
    • 厚积薄发 -- spur with long accumulation.
  • By exploring an effective operational mode with combination of “Study-Research-Product (产学研)”, CSLT aims to develop technology and application with IPR, and push forward applied basic research and technology innovation.



Organization Chat

  • 6 research groups:

    • Speech Recognition,
    • Speaker Recognition,
    • Speech-on-Chip,
    • Intelligent Searching,
    • Language Understanding, and
    • Resource and Standardization.
  • 1 joint lab + 1 joint institute



  • Advisory Board:

    • Victor Zue (MIT, IEEE Fellow, NAE member)
    • B.-H. (Fred) Juang (GeorgiaTech, IEEE Fellow, NAE member)
    • William Byrne (Cambridge)
    • Dan Jurafsky (Stanford)
    • Richard Stern (CMU)
    • FANG Ditang (Tsinghua)
    • WU Wenhu (Tsinghua)
    • LIU Runsheng (Tsinghua)
  • Directors:

    • Director: Prof. Thomas Fang Zheng
    • Deputy Director: Asso. Prof. LIU Yi (Executive)
    • Deputy Director: Asso. Prof. XIAO Xi (R&D)
    • Deputy Director: Asso. Prof. XU Mingxing (Students)


Faculty Members and Others

  • Speech Processing (ASR&VPR):

    • Associate Professor: LIU Yi
    • Associate Professor: XIAO Xi
    • Associate Professor: XU Mingxing
    • Assistant Professor: LIANG Weiqian
    • Assistant Professor: OU Zhijian
  • Natural Language Processing (NLP):

    • Associate Professor: SUN Jiasong
    • Associate Professor: ZHOU Qiang
    • Assistant Professor: WU Xiaojun
    • Assistant Professor: XIA Yunqing
  • 2 Research Associates + 2 Postdoctors

  • 3 PhD Students + 13 Master Students





  • I. Automatic Speech Recognition (ASR)



Large vocabulary Chinese speech recognition (Chinese dictation machine)

  • Large vocabulary Chinese speech recognition (Chinese dictation machine)

  • Voice command, and embedded speech recognition on chip

  • Keyword spotting with confidence measure and semantic templates

  • Spontaneous speech recognition (starting from JHU Summer Workshop ’2000)

  • Dialectal Chinese speech recognition (starting from JHU Summer Workshop ’2004) -- in this talk



Chinese ASR encounters an issue that is bigger than that of any other language - dialect.

  • Chinese ASR encounters an issue that is bigger than that of any other language - dialect.

  • There are 8 major dialectal regions in addition to Mandarin (Northern China), including:-

    • Wu (Southern Jiangsu, Zhejiang, and Shanghai);
    • Yue (Guangdong, Hong Kong, Nanning Guangxi);
    • Min (Fujian, Shantou Guangdong, Haikou Hainan, Taipei Taiwan);
    • Hakka (Meixian Guangdong, Hsin-chu Taiwan);
    • Gan (Jiangxi);
    • Xiang (Hunan);
    • Hui (Anhui)
    • Jin (Shanxi, Hohehot Inner Mongolia).
  • Can be further divided into over 40 sub-categories.





Chinese dialects share a same written language:-

  • Chinese dialects share a same written language:-

    • The same Chinese pinyin set (canonically),
    • The same Chinese character set (canonically), and
    • The same vocabulary (canonically).
  • And standard Chinese (known as Putonghua, or PTH) is widely spoken in most regions over China.

  • However, speech is strongly influenced by the native dialects, most Chinese people speak in both standard Chinese and their own dialect, resulting in dialectal Chinese - Putonghua influenced by native dialect

  • In dialectal Chinese :-

    • Word usage, pronunciation, and syntax and grammar vary depending on the speaker's dialect.
    • ASR relies to a great extent on the consistent pronunciation and usage of words within a language.
    • ASR systems constructed to process PTH perform poorly for the great majority of the population.


To develop a general framework to model in dialectal Chinese ASR tasks :-

  • To develop a general framework to model in dialectal Chinese ASR tasks :-

    • Phonetic variability,
    • Lexical variability, and
    • Pronunciation variability
  • To find suitable methods to modify the baseline PTH recognizer to obtain a dialectal Chinese recognizer for the specific dialect of interest, which employ :-

    • dialect-related knowledge (syllable mapping, cross-dialect synonyms, …), and
    • training/adaptation data (in relatively small quantities)
  • Expectation: the resulted recognizer should also work for PTH, in other words, it should be good for a mixture of PTH and dialectal Chinese.

  • This proposal was selected as one of three projects for '2003 Johns Hopkins University Summer Workshop from tens of proposals collected from universities/companies over the world, and was postponed to 2004 due to SARS.





Chinese Syllable Mapping (CSM)

  • Chinese Syllable Mapping (CSM)

    • This CSM is dialect-related.
    • Two types:
      • Word-independent CSM: e.g. in Southern Chinese, Initial mappings include zhz, chc, shs, nl, and so on, and Final mappings include engen, ingin, and so on;
      • Word-dependent CSM: e.g. in dialectal Chuan Chinese, the pinyin 'guo2' is changed into 'gui0' in word '中国(China)' but only the tone is changed in word '过去(past)'.


Lexicon

  • Lexicon

    • Linguists say the vocabulary similarity rate between PTH and Wu dialect is about 60~70%
    • A dialect-related lexicon containing two parts :-
      • a common part shared by standard Chinese and most dialectal Chinese languages (over 50k words), and
      • a dialect-related part (several hundreds).
    • And in this lexicon :-
      • each word has one pinyin string for standard Chinese pronunciation and a kind of representation for dialectal Chinese pronunciation, and
      • each of those dialect-related words is corresponding to a word in the common part with the same meaning.


Language

  • Language

    • Though it is difficult to collect dialect texts, dialect-related lexical entry replacement rules could be learned in advance, and therefore
    • The language post-processing or language model adaptation techniques could be adopted.








Totally 11 hours - Half read (R) + half spontaneous (S):

  • Totally 11 hours - Half read (R) + half spontaneous (S):

    • 100 Shanghai speakers * (3R +3S) minutes / speaker
    • 10 Beijing speakers * 6S minutes / speaker
  • Read speech with well-balanced prompting sentences;

    • Type I: each sentence contains PTH words only (5-6k)
    • Type II: each sentence contains one or two most commonly used Wu dialectal words while others are PTH words
  • Spontaneous speech with Pre-defined talking topics;

    • Conversations with PTH speaker on self-selected topic from: sports, policy/economy, entertainment, lifestyles, technology
  • Balanced Speaker (gender, age, education, PTH level, …)















At acoustic level, approaches include:

  • At acoustic level, approaches include:

    • Retraining the AM based on the standard speech and a certain amount of dialectal speech
    • Interpolation between standard speech-based HMMs and their corresponding dialectal speech based HMMs
    • Combination of AM with state-level pronunciation modeling
    • Adaptation with a certain amount of dialectal speech based on the standard speech-based AM
  • Existing problems:

    • A large amount of dialectal speech to build dialect-specific acoustic models
    • The acoustic model cannot demonstrate good performance in standard speech as well as dialectal speech recognition
    • Some acoustic modeling methods are too complicated to be deployed readily


What we proposed:

  • What we proposed:

    • Taking a precise context-dependent HMM from the standard speech and its corresponding less precise context-independent HMM from dialectal speech into consideration simultaneously
    • Merging HMMs on a state-level basis according to certain criteria




The seen disadvantage so far

  • The seen disadvantage so far

    • The scale of Gaussian mixtures in the merged state is expanded
  • Is it possible to downsize the scale?

    • A straightforward criterion is distance measure
    • The larger distance, the more coverage acoustically
      • merging, if distance (d,s)  threshold
      • no-merging, if distance (d,s) < threshold










  • II. Voiceprint Recognition (VPR)



Cross-channel (channel-mismatch) -- in this talk,

  • Cross-channel (channel-mismatch) -- in this talk,

  • Multi-speaker (such as in telephone conversations),

  • Text- and language-independent,

  • Very short speech segment (such as verification in monitoring for public security),

  • Background noise issue, and …



Cross-channel (1) -- IEEE.t.ASL’07

  • A cohort-based speaker model synthesis (SMS) algorithm, designed for synthesizing robust speaker models without requiring channel-specific enrollment data

  • Assumption: if two speakers' voices are similar in one channel, their voices will also be similar in another channel



  • Exception always exists

  • We propose to use a cohort set of speaker models instead of a single speaker model to perform the SMS.



Cross-channel (2) -- IEEE.ICASSP’07

  • We propose a new method based on

    • the idea of projection in nuisance attribute projection (NAP) (designed for GMM-SVM system), and
    • the idea of model compensation in factor analysis
    • -- called Session variability subspace projection (AVSP), the idea of which is to use the session variability in a test utterance to compensate speaker models whose session variability has been removed during training
  • The SVSP consists of four modules

    • Estimation of session variability subspace (during training)
    • Speaker model training by adaptation from UBM, session variability removed
    • Speaker model compensation with the test utterance (during recognition)
    • Test utterance verification






NIST SRE’06 Post-evaluation results

  • On the left:

    • DET curve of the best result (STBU3) in 1c4w-1c4w
    • STBU = Spescom DataVoice, South Africa + TNO, Netherlands + Brno Univ of Tech, Chech Republic + Univ Stellenbosch, South Africa
    • Fusion of 11 systems;
  • On the right:

    • DET curve of our system in 1c4w-1c4w
    • Fusion of 2 systems (GMM-UBM and SVM)


Applications of VPR

  • Identification for network security (w/ Ministry of Public Security)

  • Verification for user authentication (w/ Ministry of Public Security and People’s Armed Police College)

  • Verification for user authentication (w/ China Mobile, Commercial Bank of Baotou, ...)



Demos -- Speaker recognition application in passport control







  • III. Natural Language Processing (NLP)



Current Focuses

  • Key concept based parsing,

  • Dialogue management,

  • SDS Studio -- in this talk,

  • Vertical search engine -- in this talk, and

  • Opinion mining, abstracting, text-categorization, …



d-Ear SDS Studio



Additional Tools





Semantic Parsing



Fuzzy Match



Result Ranking



Map Semantics



Collaboration model

  • Collaboration model

    • Technologies originate from Tsinghua
    • Tool kits, products, and services being developed by d-Ear
  • Current services provided

    • House renting, job hunting, train and ticket info, digital products, singers and songs, greeting words, ……
  • Multi-modal interfaces

    • Webpage, WAP, SMS, and in future telephone (ASR), …




CST is one of the 8 co-founders of Chinese Corpus Consortium (CCC, founded in March 2004, http://www.CCCForum.org).

  • CST is one of the 8 co-founders of Chinese Corpus Consortium (CCC, founded in March 2004, http://www.CCCForum.org).

  • CCC is a non-profit,academic consortium formed voluntarily by international companies and scientific research institutes interested in the construction and application of Chinese speech and linguistic corpus resources.

  • The purpose of CCC includes

    • Collecting and integrating existing Chinese speech and linguistic corpus resources, and continuing creation of new such resources.
    • Integrating existing tools for the creation, transcription, and analysis of Chinese speech and linguistic corpus resources, improving their usability, and creating new tools.
    • Collecting, organizing and introducing the specifications and standards for Chinese speech and language research and development.
    • Promoting the exchange of Chinese speech and linguistic corpus resources.






CSLP actively participates in the standardization activities

  • CSLP actively participates in the standardization activities

  • On Nov. 18, 2003, CSITSG was officially established under the approval from the Ministry of Information Industry (MII), including 5 special topic groups:

    • Speech recognition,
    • Speaker recognition (voiceprint recognition),
    • Speech synthesis,
    • Speech assessment, and
    • Databases and transcription


On Apr. 24, 2004, MII approved the plans for drafting standards of speech recognition, speaker recognition, and speech synthesis. CSLT was in charge of speaker recognition standard.

  • On Apr. 24, 2004, MII approved the plans for drafting standards of speech recognition, speaker recognition, and speech synthesis. CSLT was in charge of speaker recognition standard.

  • The speaker recognition standard includes three parts:

    • Terms and definitions;
    • Data exchange formats and data structures; and
    • API functions.
  • In Dec. 2006, “the automatic speaker recognition (voiceprint recognition) standards” draft was approved by the expert committee, and will be announced this year.



Subcommittee 2 (SC2) on Human Biometrics Application of Technical Committee 100 (TC100) on Security Protection Alarm Systems of Standardization Administration of China (SAC) formed in 2007

  • Subcommittee 2 (SC2) on Human Biometrics Application of Technical Committee 100 (TC100) on Security Protection Alarm Systems of Standardization Administration of China (SAC) formed in 2007

  • Application oriented:

    • Public security
  • Multi-modal:

    • Fingerprint,
    • face,
    • voiceprint,
    • iris,
    • palm,
  • CSLT -- Vice Chair of SAC/TC100/SC2





Download 4.54 Mb.

Do'stlaringiz bilan baham:




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2020
ma'muriyatiga murojaat qiling