Please use this identifier to cite or link to this item:
https://elibrary.khec.edu.np/handle/123456789/1019
Title: | Nepali Text To Speech Synthesis |
Authors: | Dipesh Shrestha Janak Raj Joshi Nabraj Joshi Piyush Gwayamaru Rashik Prajapati (770310) (770313) (770320) (770325) (770330) |
Advisor: | Sarala Shakya |
Keywords: | Nepali TTS, Text-to-Speech synthesis, Tacotron model architecture, Mel-Spectrogram, Griffin-Lim Algorithm, OpenSLR, Attention alignment |
Issue Date: | 2025 |
College Name: | Khwopa Engineering College |
Level: | Bachelor's Degree |
Degree: | BE Computer |
Department Name: | Department of Computer Engineering |
Abstract: | Nepali Text-to-Speech Synthesis System is a neural Text-to-Speech (TTS) system designed to convert Nepali text into speech using the Tacotron model architec ture. The dataset was compiled from three sources: 2,064 audio-text pairs from OpenSLR43, 57,000 samples from OpenSLR54 (filtered down to 25,218 and further refined into 10,737 samples with an average duration of 10 seconds), and 1,300 manually recorded Nepali sentences using Audacity. Preprocessing included noise removal, silence trimming, correction of extra spaces and special characters, and ac curate alignment of text with corresponding audio. The final dataset was formatted in Tab-Separated Values (TSV) and split into training and testing sets in an 80:20 ratio. The Tacotron-based model was trained to generate mel-spectrograms from input text, which were then converted to waveform audio using the Griffin-Lim algorithm. Training employed an initial learning rate of 0.002 along with techniques such as attention alignment and regularization to enhance pronunciation accuracy and naturalness of speech. Model performance was monitored using L1 loss during training, while attention alignment plots were used to visualize the mapping be tween encoder and decoder timesteps. Post-training evaluation involved comparing predicted mel-spectrograms with ground truth spectrograms. The system achieved an average training loss of 0.09485 and an average evaluation loss of 0.10690. |
URI: | https://elibrary.khec.edu.np/handle/123456789/1019 |
Appears in Collections: | PU Computer Report |
Files in This Item:
File | Size | Format | |
---|---|---|---|
Nepali Text To Speech Synthesis.pdf Restricted Access | 1.66 MB | Adobe PDF | View/Open Request a copy |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.