Environmental sound extraction using onomatopoeic words
  

Author: Yuki Okamoto1, Shota Horiguchi2, Masaaki Yamamoto2, Keisuke Imoto3, Yohei Kawaguchi2
Affiliation: 1 Ritsumeikan University, 2 Hitachi, Ltd., 3 Doshisha University


This is a demonstration of environmental sound extraction using an onomatopoeic word. We conducted environmental sound extraction using one proposed method and two comparison methods as follows:

As sounds of the dataset, we used RWCP-SSD (Real World Computing Partnership-sound Scene Database) [1]. Some sound events in RWCP-SSD are labeled in the “event entry + ID” format, e.g., whistle1 and whistle2. We created hierarchical sound-event classes by grouping labels with the same event entry, e.g., whistle. We first selected 44 sound events from RWCP-SSD, which we call subclasses, and grouped them into 16 superclasses. For the onomatopoeic words corresponding to each sound sample, we used the dataset in RWCP-SSD-Onomatopoeia [2].
We conducted the following three evaluation datasets using the selected sound events:






(I) Examples of environmental sound extraction using inter-superclass dataset

Mixture sound Ground truth Superclass-conditioned method
(Baseline)
Subclass-conditioned method
(Baseline)
Onomatopoeia-conditioned method
(Proposed)
 Superclass: whistle  Subclass: whistle2  Onomatopoeic word: / b i: /
 SDRi = 2.87 dB  SDRi = 2.40 dB  SDRi = 2.31 dB

Mixture sound Ground truth Superclass-conditioned method
(Baseline)
Subclass-conditioned method
(Baseline)
Onomatopoeia-conditioned method
(Proposed)
 Superclass: phone  Subclass: phone1  Onomatopoeic word: / p u r u r u r u /
 SDRi = 1.62 dB  SDRi = 3.45 dB  SDRi = 4.51 dB

Mixture sound Ground truth Superclass-conditioned method
(Baseline)
Subclass-conditioned method
(Baseline)
Onomatopoeia-conditioned method
(Proposed)
 Superclass: phone  Subclass: phone3  Onomatopoeic word: / p i p i p i p i p i p i /
 SDRi = 4.38 dB  SDRi = 2.56 dB  SDRi = 3.35 dB





(II) Examples of environmental sound extraction using intra-superclass dataset

Mixture sound Ground truth Superclass-conditioned method
(Baseline)
Subclass-conditioned method
(Baseline)
Onomatopoeia-conditioned method
(Proposed)
 Superclass: dice  Subclass: dice1  Onomatopoeic word: / p a N t a r a r a /
 SDRi = 2.77 dB  SDRi = 4.71 dB  SDRi = 2.59 dB

Mixture sound Ground truth Superclass-conditioned method
(Baseline)
Subclass-conditioned method
(Baseline)
Onomatopoeia-conditioned method
(Proposed)
 Superclass: cup  Subclass: cup1  Onomatopoeic word: / a t i N /
 SDRi = 2.30 dB  SDRi = 5.94 dB  SDRi = 5.24 dB

Mixture sound Ground truth Superclass-conditioned method
(Baseline)
Subclass-conditioned method
(Baseline)
Onomatopoeia-conditioned method
(Proposed)
 Superclass: bells  Subclass: bells1  Onomatopoeic word: / ch i r i r i r i r i r i r i /
 SDRi = -1.97 dB  SDRi = 9.27 dB  SDRi = 2.66 dB





(III) Examples of environmental sound extraction using intra-subclass dataset

Mixture sound Ground truth Superclass-conditioned method
(Baseline)
Subclass-conditioned method
(Baseline)
Onomatopoeia-conditioned method
(Proposed)
 Superclass: metal  Subclass: metal05  Onomatopoeic word: / p o q /
 SDRi = 0.28 dB  SDRi = 7.98 dB  SDRi = 9.26 dB

Mixture sound Ground truth Superclass-conditioned method
(Baseline)
Subclass-conditioned method
(Baseline)
Onomatopoeia-conditioned method
(Proposed)
 Superclass: bell  Subclass: bell1  Onomatopoeic word: / p i r i r i r i r i N /
 SDRi = 0.35 dB  SDRi = 1.35 dB  SDRi = 5.30 dB

Mixture sound Ground truth Superclass-conditioned method
(Baseline)
Subclass-conditioned method
(Baseline)
Onomatopoeia-conditioned method
(Proposed)
 Superclass: phone  Subclass: phone4  Onomatopoeic word: / p u r u r u r u r u r u /
 SDRi = 0.14 dB  SDRi = 0.33 dB  SDRi = 2.63 dB

Mixture sound Ground truth Superclass-conditioned method
(Baseline)
Subclass-conditioned method
(Baseline)
Onomatopoeia-conditioned method
(Proposed)
 Superclass: claps  Subclass: claps2  Onomatopoeic word: / t i ch i ch i ch i ch ich i /
 SDRi = 1.17 dB  SDRi = 1.32 dB  SDRi = 4.89 dB

Mixture sound Ground truth Superclass-conditioned method
(Baseline)
Subclass-conditioned method
(Baseline)
Onomatopoeia-conditioned method
(Proposed)
 Superclass: dice  Subclass: dice3  Onomatopoeic word: / t o q t u t u t u /
 SDRi = 1.63 dB  SDRi = 2.44 dB  SDRi = 5.85 dB



[1] S. Nakamura, K. Hiyane, F. Asano, and T. Endo, “Sound scene data collection in real acoustic environments,” The Journal of the Acoustic Society of Japan (E), vol. 20, No. 3, pp. 225–231, 1999.
[2] Yuki Okamoto, Keisuke Imoto, Shinnosuke Takamichi, Ryosuke Yamanishi, Takahiro Fukumori, and Yoichi Yamashita, "RWCP-SSD-Onomatopoeia: Onomatopoeic Word Dataset for Environmental Sound Synthesis," Proc. Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 125-129, 2020.