The developments of the modern text-to-speech (TTS) technology have matured in which the direction of the recent approaches has moved toward the optimization of the system and TTS modeling from the resource-scarce languages, rather than finding new model architectures. In this paper, a novel approach to modeling modern end-to-end (E2E) TTS for Indonesian language with the integration of three different generative adversarial networks (GAN)-based vocoders for comparison is proposed. Based on the evaluation, the proposed system shows promising results with the mean opinion score (MOS) value of 4.60 while still maintaining fast inference speed, proven by the real-time factor (RTF) value under one.
Copyrights © 2022