標題: 利用CNN類神經法於中文單音之辨識
The Mandarin Monosyllable Recognition by Using the Method of Convolutional Neural Network
作者: 鄒振宏
Chen-Hung Tsou
關鍵字: 多層感知機;梅爾倒頻譜係數;卷積類神經網路;語音辨識;機器學習;MLP;MFCC;CNN;Speech recognition;Machine learning
本論文主要探討卷積類神經網路(Convolutional neural network, CNN)在中文單音上的辨識。將20個不同語者所錄製的1391個單音,進行數位採樣、音框切割、視窗化等一系列的前處理後,取得梅爾倒頻譜係數(Mel-Frequency cepstral coefficients,MFCC)作為模型的輸入特徵。本方法將利用卷積、池化、批標準化等過程,對原始特徵做進一步的擷取,最後再輸入多層感知機(Multilayer perceptron, MLP)進行分類。除了將全部1391個單音直接分類外,也嘗試了其他的分類方法,如先分母音、再分子音的模型設計,或者進一步將母音聲調作分類,共3個主要模型,辨識率分別為82.89%、82.76%、80.46%。最後再透過模型不加權投票,得到最佳辨識率84.05%。

This thesis mainly discusses the speech recognition using CNN(convolutional neural network) in Chinese monophonic. MFCC(Mel-Frequency cepstral coefficients) were obtained as models after a series of pre-processing such as digital sampling, frame cutting and windowing were performed on a total of 1391 single tones recorded by 20 different speakers as the input of model. This method will use the convolution, pooling, batch normalization and other layers to further extract the original features, and finally input the MLP(multilayer perceptron) for classification. In addition to directly classifying all 1391 monophonic, other classification methods have been tried, such as model design of first denominator and re-molecular sound, or further classification of vowel tones. There are 3 main models with recognition rates of 82.89. %, 82.76%, 80.46%. Finally, through the model unweighted voting, the best recognition rate is 84.05%.
