Nepali Text To Speech Synthesis

Dipesh Shrestha Janak Raj Joshi Nabraj Joshi Piyush Gwayamaru Rashik Prajapati (770310) (770313) (770320) (770325) (770330)

Please use this identifier to cite or link to this item: https://elibrary.khec.edu.np/handle/123456789/1019

Title:	Nepali Text To Speech Synthesis
Authors:	Dipesh Shrestha Janak Raj Joshi Nabraj Joshi Piyush Gwayamaru Rashik Prajapati (770310) (770313) (770320) (770325) (770330)
Advisor:	Sarala Shakya
Keywords:	Nepali TTS, Text-to-Speech synthesis, Tacotron model architecture, Mel-Spectrogram, Griffin-Lim Algorithm, OpenSLR, Attention alignment
Issue Date:	2025
College Name:	Khwopa Engineering College
Level:	Bachelor's Degree
Degree:	BE Computer
Department Name:	Department of Computer Engineering
Abstract:	Nepali Text-to-Speech Synthesis System is a neural Text-to-Speech (TTS) system designed to convert Nepali text into speech using the Tacotron model architec ture. The dataset was compiled from three sources: 2,064 audio-text pairs from OpenSLR43, 57,000 samples from OpenSLR54 (filtered down to 25,218 and further refined into 10,737 samples with an average duration of 10 seconds), and 1,300 manually recorded Nepali sentences using Audacity. Preprocessing included noise removal, silence trimming, correction of extra spaces and special characters, and ac curate alignment of text with corresponding audio. The final dataset was formatted in Tab-Separated Values (TSV) and split into training and testing sets in an 80:20 ratio. The Tacotron-based model was trained to generate mel-spectrograms from input text, which were then converted to waveform audio using the Griffin-Lim algorithm. Training employed an initial learning rate of 0.002 along with techniques such as attention alignment and regularization to enhance pronunciation accuracy and naturalness of speech. Model performance was monitored using L1 loss during training, while attention alignment plots were used to visualize the mapping be tween encoder and decoder timesteps. Post-training evaluation involved comparing predicted mel-spectrograms with ground truth spectrograms. The system achieved an average training loss of 0.09485 and an average evaluation loss of 0.10690.
URI:	https://elibrary.khec.edu.np/handle/123456789/1019
Appears in Collections:	PU Computer Report

Files in This Item:

File	Description	Size	Format
Nepali Text To Speech Synthesis.pdf Restricted Access		1.66 MB	Adobe PDF	View/Open Request a copy

Show full item record