This is a demonstration of environmental sound synthesis from onomatopoeic words [1]. We propose two methods of environmental sound synthesis from onomatopoeic words on the basis of the sequence-to-sequence conversion framework [2] as follows:
Environmental sound synthesis using only onomatopoeic words (seq2seq)
Environmental sound synthesis using onomatopoeic words and sound event labels (seq2seq + event label)
In addition to onomatopoeic words, this method also uses sound event labels to control the decoder’s output features on the decoder.
As sounds of the dataset, we used 10 different sound events (manual coffee grinder, cup clinking, alarm clock ringing, whistle, maracas, drum, electric shave, trash box banging, tearing paper, bell ringing) contained in the RWCP-SSD (Real World Computing Partnership-sound Scene Database) [3]. For the onomatopoeic words corresponding to each sound sample, we used the dataset in RWCP-SSD-Onomatopoeia [4].
Natural sound
KanaWave
Seq2seq (Proposed)
Seq2Seq + event label (Proposed)
Phoneme sequence: / ch i: q /
Whistle
Cup
Shaver
Whistle
Tearing paper
Phoneme sequence: / b o N q /
Trash box
Drum
Trash box
Phoneme sequence: / r i N r i N /
Bell
Bell
Clock
Phoneme sequence: / b i i i i /
Shaver
Tearing paper
Whistle
Shaver
Phoneme sequence: / sh a r i sh a r i /
Maracas
Maracas
Manual coffee grinder
Comparison of synthesized sounds with different input onomatopoeic words
Seq2Seq + event label (Proposed)
Sound event label: Cup
Phoneme sequence: / k a ch i N /
Phoneme sequence: / k a ch i q /
Phoneme sequence: / p i N q /
Sound event label: Shaver
Phoneme sequence: / b a: u a /
Phoneme sequence: / b e: /
Phoneme sequence: / j i i: j i i: i /
[1] Yuki Okamoto, Keisuke Imoto, Shinnosuke Takamichi, Ryosuke Yamanishi, Takahiro Fukumori, and Yoichi Yamashita, "Onoma-to-wave: Environmental Sound Synthesis from Onomatopoeic Words," APSIPA Transactions on Signal and Information Processing, Vol. 11, No. 1, e13, 2022.
[2] Ilya Sutskever, Oriol Vinyalsa and Quoc V. Le, "Sequence to Sequence Learning with Neural Networks," arXiv preprint, arXiv:1409.3215, 2014.
[3] S. Nakamura, K. Hiyane, F. Asano, and T. Endo, “Sound scene data collection in real acoustic environments,” The Journal of the Acoustic Society of Japan (E), vol. 20, No. 3, pp. 225–231, 1999.
[4] Yuki Okamoto, Keisuke Imoto, Shinnosuke Takamichi, Ryosuke Yamanishi, Takahiro Fukumori, and Yoichi Yamashita, "RWCP-SSD-Onomatopoeia: Onomatopoeic Word Dataset for Environmental Sound Synthesis," Proc. Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 125-129, 2020.