Speech Detection, Classification, and Processing for Improved Automatic Speech Recognition in Multiparty Meetings

Speech Detection, Classification, and Processing for Improved Automatic Speech Recognition in Multiparty Meetings

At-a-glance

Outline of talk

The meeting domain

Meeting ASR set-up

Meeting ASR set-up

Meeting ASR set-up

Meeting ASR set-up

Meeting ASR set-up

ASR in multiparty meetings

ASR in multiparty meetings

ASR in multiparty meetings

ASR in multiparty meetings

Performance metrics

Crosstalk and overlapped speech

Scope of project

Part I: Speech Activity Detection for Nearfield Microphones

Related work

Candidate features

Candidate features

Candidate features

Candidate features

Feature generation and combination

Work plan for part I

Part II: Overlap Detection for Farfield Microphones

Related work

Candidate features

Candidate features

Candidate features

Candidate features

Candidate features

Feature generation and combination

Work plan for part II

Part III: Overlap Speech Processing for Farfield Microphones

Related work

Related work

Related work

Related work

Harmonic enhancement and suppression

Adaptive decorrelation filtering

Adaptive decorrelation filtering

Adaptive decorrelation filtering

Work plan for part III

Preliminary Experiments

Expt. 1: Single feature performance

Expt. 2: Initial feature combination

Summary

Summary

Summary

Do'stlaringiz bilan baham:

Speech Detection, Classification, and Processing for Improved Automatic Speech Recognition in Multiparty Meetings

Speech Detection, Classification, and Processing for Improved Automatic Speech Recognition in Multiparty Meetings

Kofi A. Boakye

Advisor: Nelson Morgan

January 17th, 2007

At-a-glance

Outline of talk

Introduction

Speech activity detection for nearfield microphones

Overlap speech detection for farfield microphones

Overlap speech processing for farfield microphones

Preliminary experiments

The meeting domain

Multiparty meetings are a rich content source for spoken language technology

Good automatic speech recognition (ASR) is important

Meeting ASR set-up

For a typical set-up, meeting ASR audio data is obtained from various sensors located in the room. Common types include:

Individual Headset Microphone

Meeting ASR set-up

For a typical set-up, meeting ASR audio data is obtained from various sensors located in the room. Common types include:

Lapel Microphone

Meeting ASR set-up

For a typical set-up, meeting ASR audio data is obtained from various sensors located in the room. Common types include:

Tabletop Microphone

Meeting ASR set-up

For a typical set-up, meeting ASR audio data is obtained from various sensors located in the room. Common types include:

Linear Microphone Array

Meeting ASR set-up

For a typical set-up, meeting ASR audio data is obtained from various sensors located in the room. Common types include:

Circular Microphone Array

ASR in multiparty meetings

Nearfield recognition is generally performed by decoding each audio channel separately

ASR in multiparty meetings

Nearfield recognition is generally performed by decoding each audio channel separately

ASR in multiparty meetings

Farfield recognition is done in one of two ways:

1) Signal combination

ASR in multiparty meetings

Farfield recognition is done in one of two ways:

2) Hypothesis combination

Performance metrics

Word error rate (WER)

Diarization error rate (DER)

Crosstalk and overlapped speech

ASR in meetings presents specific challenges owing to the domain

Multiple individuals speaking at various times leads to two phenomena in particular

Scope of project

Speech activity detection (SAD) for nearfield mics

Overlap detection for farfield mics

Overlap speech processing for farfield mics

Part I: Speech Activity Detection for Nearfield Microphones

Related work

Amount of work specific to multi-speaker SAD is rather small

Wrigley et al. ’03 and ’05

Pfau et al. ’01

Laskowski et al. ’04

Candidate features

Cepstral features

Candidate features

Cross-channel correlation

Candidate features

Log-energy differences

Normalized log-energy difference

Candidate features

Time delay of arrival (TDOA) estimates

Feature generation and combination

One issue with cross-channel features: variable number of channels

Proposed solution: use order statistics (max and min)

Considered feature combination as well

Work plan for part I

Compare performance of HMM segmentation using proposed features

Part II: Overlap Detection for Farfield Microphones

Related work

“Usable” speech for speaker recognition

Candidate features

Cepstral features

Candidate features

Cross-channel correlation

Candidate features

Pitch estimation features