This is a demonstration of environmental sound extraction using an onomatopoeic word.
We conducted environmental sound extraction using one proposed method and two comparison methods as follows:
Superclass-conditioned method (Baseline)
This method is uses a superclass sound event classes as a condition.
Subclass-conditioned method (Baseline)
This method is uses a subclass sound event classes as a condition.
Onomatopoeia-conditioned method (Proposed)
This method is uses an onomatopoeic word as a condition.
As sounds of the dataset, we used RWCP-SSD (Real World Computing Partnership-sound Scene Database) [1].
Some sound events in RWCP-SSD are labeled in the “event entry + ID” format, e.g., whistle1 and whistle2.
We created hierarchical sound-event classes by grouping labels with the same event entry, e.g., whistle.
We first selected 44 sound events from RWCP-SSD, which we call subclasses, and grouped them into 16 superclasses.
For the onomatopoeic words corresponding to each sound sample, we used the dataset in RWCP-SSD-Onomatopoeia [2].
We conducted the following three evaluation datasets using the selected sound events:
Each mixture sound in this dataset is composed of a target sound and interference sounds, the superclass of each is different from that of the target sound.
Each mixture sound in this dataset is composed of a target sound and interference sounds, the superclass of each is the same as that of the target sound, but the subclass is different.
Each mixture sound in this dataset is composed of a target sound and interference sounds, the subclass of each is the same as that of the target sound, but the onomatopoeic word is different.
Onomatopoeic word: / t i ch i ch i ch i ch ich i /
SDRi = 1.17 dB
SDRi = 1.32 dB
SDRi = 4.89 dB
Mixture sound
Ground truth
Superclass-conditioned method (Baseline)
Subclass-conditioned method (Baseline)
Onomatopoeia-conditioned method (Proposed)
Superclass: dice
Subclass: dice3
Onomatopoeic word: / t o q t u t u t u /
SDRi = 1.63 dB
SDRi = 2.44 dB
SDRi = 5.85 dB
[1] S. Nakamura, K. Hiyane, F. Asano, and T. Endo, “Sound scene data collection in real acoustic environments,” The Journal of the Acoustic Society of Japan (E), vol. 20, No. 3, pp. 225–231, 1999.
[2] Yuki Okamoto, Keisuke Imoto, Shinnosuke Takamichi, Ryosuke Yamanishi, Takahiro Fukumori, and Yoichi Yamashita, "RWCP-SSD-Onomatopoeia: Onomatopoeic Word Dataset for Environmental Sound Synthesis," Proc. Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 125-129, 2020.