Environmental sound extraction using onomatopoeic words

Environmental sound extraction using onomatopoeic words
　　

Author: Yuki Okamoto¹, Shota Horiguchi², Masaaki Yamamoto², Keisuke Imoto³, Yohei Kawaguchi²
Affiliation: ¹Ritsumeikan University, ²Hitachi, Ltd., ³Doshisha University

This is a demonstration of environmental sound extraction using an onomatopoeic word. We conducted environmental sound extraction using one proposed method and two comparison methods as follows:

Superclass-conditioned method (Baseline)

Subclass-conditioned method (Baseline)

Onomatopoeia-conditioned method (Proposed)

As sounds of the dataset, we used RWCP-SSD (Real World Computing Partnership-sound Scene Database) [1]. Some sound events in RWCP-SSD are labeled in the “event entry + ID” format, e.g., whistle1 and whistle2. We created hierarchical sound-event classes by grouping labels with the same event entry, e.g., whistle. We first selected 44 sound events from RWCP-SSD, which we call subclasses, and grouped them into 16 superclasses. For the onomatopoeic words corresponding to each sound sample, we used the dataset in RWCP-SSD-Onomatopoeia [2].
We conducted the following three evaluation datasets using the selected sound events:

(I) Inter-superclass dataset (samples of extracted sounds)

(II) Intra-superclass dataset (samples of extracted sounds)

(III) Intra-subclass dataset (samples of extracted sounds)

(I) Examples of environmental sound extraction using inter-superclass dataset

Mixture sound Ground truth Superclass-conditioned method
(Baseline) Subclass-conditioned method
(Baseline) Onomatopoeia-conditioned method
(Proposed)

Superclass: whistle Subclass: whistle2 Onomatopoeic word: / b i: /
SDRi = 2.87 dB SDRi = 2.40 dB SDRi = 2.31 dB

Mixture sound Ground truth Superclass-conditioned method
(Baseline) Subclass-conditioned method
(Baseline) Onomatopoeia-conditioned method
(Proposed)

Superclass: phone Subclass: phone1 Onomatopoeic word: / p u r u r u r u /
SDRi = 1.62 dB SDRi = 3.45 dB SDRi = 4.51 dB

Mixture sound Ground truth Superclass-conditioned method
(Baseline) Subclass-conditioned method
(Baseline) Onomatopoeia-conditioned method
(Proposed)

Superclass: phone Subclass: phone3 Onomatopoeic word: / p i p i p i p i p i p i /
SDRi = 4.38 dB SDRi = 2.56 dB SDRi = 3.35 dB

(II) Examples of environmental sound extraction using intra-superclass dataset

Mixture sound Ground truth Superclass-conditioned method
(Baseline) Subclass-conditioned method
(Baseline) Onomatopoeia-conditioned method
(Proposed)

Superclass: dice Subclass: dice1 Onomatopoeic word: / p a N t a r a r a /
SDRi = 2.77 dB SDRi = 4.71 dB SDRi = 2.59 dB

Mixture sound Ground truth Superclass-conditioned method
(Baseline) Subclass-conditioned method
(Baseline) Onomatopoeia-conditioned method
(Proposed)

Superclass: cup Subclass: cup1 Onomatopoeic word: / a t i N /
SDRi = 2.30 dB SDRi = 5.94 dB SDRi = 5.24 dB

Mixture sound Ground truth Superclass-conditioned method
(Baseline) Subclass-conditioned method
(Baseline) Onomatopoeia-conditioned method
(Proposed)

Superclass: bells Subclass: bells1 Onomatopoeic word: / ch i r i r i r i r i r i r i /
SDRi = -1.97 dB SDRi = 9.27 dB SDRi = 2.66 dB

(III) Examples of environmental sound extraction using intra-subclass dataset

Superclass-conditioned method (Baseline)	Subclass-conditioned method (Baseline)	Onomatopoeia-conditioned method (Proposed)

Superclass: metal	Subclass: metal05	Onomatopoeic word: / p o q /
SDRi = 0.28 dB	SDRi = 7.98 dB	SDRi = 9.26 dB

Superclass-conditioned method (Baseline)	Subclass-conditioned method (Baseline)	Onomatopoeia-conditioned method (Proposed)

Superclass: bell	Subclass: bell1	Onomatopoeic word: / p i r i r i r i r i N /
SDRi = 0.35 dB	SDRi = 1.35 dB	SDRi = 5.30 dB

Superclass-conditioned method (Baseline)	Subclass-conditioned method (Baseline)	Onomatopoeia-conditioned method (Proposed)

Superclass: phone	Subclass: phone4	Onomatopoeic word: / p u r u r u r u r u r u /
SDRi = 0.14 dB	SDRi = 0.33 dB	SDRi = 2.63 dB

Superclass-conditioned method (Baseline)	Subclass-conditioned method (Baseline)	Onomatopoeia-conditioned method (Proposed)

Superclass: claps	Subclass: claps2	Onomatopoeic word: / t i ch i ch i ch i ch ich i /
SDRi = 1.17 dB	SDRi = 1.32 dB	SDRi = 4.89 dB

Superclass-conditioned method (Baseline)	Subclass-conditioned method (Baseline)	Onomatopoeia-conditioned method (Proposed)

Superclass: dice	Subclass: dice3	Onomatopoeic word: / t o q t u t u t u /
SDRi = 1.63 dB	SDRi = 2.44 dB	SDRi = 5.85 dB

[1] S. Nakamura, K. Hiyane, F. Asano, and T. Endo, “Sound scene data collection in real acoustic environments,” The Journal of the Acoustic Society of Japan (E), vol. 20, No. 3, pp. 225–231, 1999.
[2] Yuki Okamoto, Keisuke Imoto, Shinnosuke Takamichi, Ryosuke Yamanishi, Takahiro Fukumori, and Yoichi Yamashita, "RWCP-SSD-Onomatopoeia: Onomatopoeic Word Dataset for Environmental Sound Synthesis," Proc. Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 125-129, 2020.

Superclass-conditioned method (Baseline)	Subclass-conditioned method (Baseline)	Onomatopoeia-conditioned method (Proposed)

Superclass: whistle	Subclass: whistle2	Onomatopoeic word: / b i: /
SDRi = 2.87 dB	SDRi = 2.40 dB	SDRi = 2.31 dB

Superclass-conditioned method (Baseline)	Subclass-conditioned method (Baseline)	Onomatopoeia-conditioned method (Proposed)

Superclass: phone	Subclass: phone1	Onomatopoeic word: / p u r u r u r u /
SDRi = 1.62 dB	SDRi = 3.45 dB	SDRi = 4.51 dB

Superclass-conditioned method (Baseline)	Subclass-conditioned method (Baseline)	Onomatopoeia-conditioned method (Proposed)

Superclass: phone	Subclass: phone3	Onomatopoeic word: / p i p i p i p i p i p i /
SDRi = 4.38 dB	SDRi = 2.56 dB	SDRi = 3.35 dB

Superclass-conditioned method (Baseline)	Subclass-conditioned method (Baseline)	Onomatopoeia-conditioned method (Proposed)

Superclass: dice	Subclass: dice1	Onomatopoeic word: / p a N t a r a r a /
SDRi = 2.77 dB	SDRi = 4.71 dB	SDRi = 2.59 dB

Superclass-conditioned method (Baseline)	Subclass-conditioned method (Baseline)	Onomatopoeia-conditioned method (Proposed)

Superclass: cup	Subclass: cup1	Onomatopoeic word: / a t i N /
SDRi = 2.30 dB	SDRi = 5.94 dB	SDRi = 5.24 dB

Superclass-conditioned method (Baseline)	Subclass-conditioned method (Baseline)	Onomatopoeia-conditioned method (Proposed)

Superclass: bells	Subclass: bells1	Onomatopoeic word: / ch i r i r i r i r i r i r i /
SDRi = -1.97 dB	SDRi = 9.27 dB	SDRi = 2.66 dB