Please use this identifier to cite or link to this item: http://hdl.handle.net/11455/19866
標題: 在大量中文語料中語言模型關於平滑問題特性之分析
Analyzing Properties of Smoothing Issues for Language Models in Large Mandarin Corpus
作者: 黃健祐
Hwang, Chien-Yo
關鍵字: 語言模型
Language models
平滑化
混淆度
交叉熵
smoothing methods
perplexity
cross entropy
出版社: 資訊網路多媒體研究所
引用: [1] 呂宜玲,中文語音辨識中語言模型的強化之研究,國立交通大學資訊工程系所,碩士學位論文,2005。 [2] 王韋華、徐波,漢語語言模型的規型對統計機器翻譯系統的影響,微計算機信息 Microcomputer Information,2010年第26卷第9-3期。 [3] 李民祥、吳世弘、曾議慶、楊秉哲、谷圳,基於對照表以及語言模型之簡繁字體轉換,Computational Linguistics and Chinese Language Processing,Vol. 15,No. 1,March 2010,pp. 19-36。 [4] 顧平、朱巧明、李培峰、錢培德,智能型漢字數碼輸入法技術的研究,中文信息學報,2006年第20卷第4期。 [5] 賴亦傑,應用多詞及多詞性語言模型的中文斷詞及詞性標記方式,國立中興大學資訊科學與工程學系,碩士學位論文,2011。 [6] W. Naptali, Masatoshi Tsuchiya, and Seiichi Nakagawa, Topic-Dependent Language Model with Voting on Noun History, ACM Transactions on Asian Language Information Processing, Vol. 9, No. 2, Article 7 ,2010 [7] 王云凱、王萍,基于自然語言處理模型的多音字對漢語拼音字母排序的影響研究,西南民族大學學報自然科學版,2012年第38卷第3期 [8] 袁里馳,融合語言知識的統計句法分析,中南大學學報自然科學版,2012年第43卷第3期 [9] 陳林、楊丹,獨立于語種的文本分類方法,計算機工程與科學,2008年第30卷第6期 [10] 郭雷,統計語言模型分析,軟體導刊,2011年第10卷第11期 [11] Algort P. H. and Cover T. M., 1988, A Sandwich Proof of the Shannon- McMillan-Breiman Theorem, Ahe Annals of Probability, Vol. 16, No. 2, pp. 899-909. [12] Jurafsky D. and Martin J. H., 2008, Speech and Language Processing (2nd Edition), Prentice Hall, Chapter 6. [13] 袁毓林,基于統計的語言處理模型的局限性,語言文字應用,2004年5月第2期 [14] H. Jeffreys, Theory of Probability, Clarendon Press, Oxford, Second Edition, 1948 [15] Good I. J., 1953, The Population Frequencies of Species and the Estimation of Population Parameters, Biometrika, Vol. 40, pp. 237-264. [16] Chen Standy F. and Goodman Joshua, 1999, An Empirical Study of Smoothing Techniques for Language Modeling, Computer Speech and Language, Vol. 13, pp. 359-394 [17] Jelinek F., Statistical Methods for Speech Recognition, The MIT Press, Cambridge Massachusetts, 1997. [18] Nadas A., 1985, On Turing’s Formula for Word Probabilities, IEEE Trans. On Acoustic, Speech and Signal Processing, Vol. ASSP-33, pp. 1414-1416 [19] W. A. Gale and G. Sampson., Good-Turing Frequency Estimation without Tears. Journal of Quantitative Linguistics, 2(3): 15-19, 1995 [20] Ney H. and Essen U., 1991, On Smoothing Techniques for Bigram-Based Natural Language Modeling, IEEE International conference on Acoustic, Speech and Signal Processing, pp. 825-828. [21] Chinese GigaWord語料庫第三版。 http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T38 [22] 中研院線上斷詞器 http://ckipsvr.iis.sinica.edu.tw/ [23] 平衡語料庫簡介 http://db1x.sinica.edu.tw/cgi-bin/kiwi/mkiwi/mkiwi.sh [24] 中央研究院資訊所、語言所詞庫小組所編技術報告第 95-02/98-04號「中央研究院漢語料庫的內容與說明」 [25] Katz S. M., March 1987, Estimation of Probabilities from Sparse Data for the Language Models Component of a Speech Recognizer, IEEE Trans. On Acoustic, Speech and Signal Processing, Vol. ASSP-35, pp. 400-401.  
摘要: 平滑化處理是自然語言中非常根本且重要的課題,許多相關的應用如語音辨識、機器翻譯、輸入法,甚至是繁簡轉換的問題都會需要平滑化處理。平滑化處理主要是用來解決統計語言模型在實際應用中數據稀疏問題並且利用機率的方式去估算每個事件的機率值。 論文中,首先對平滑化方法的交叉熵和混淆度進行論述,在語言模型中,由於數據稀疏的緣故,平滑化方法使用在估算每個events的機率。文章中會提到數種知名的平滑化方法:Additive Discount Method, Good-Turing Method, Witten-Bell Method。 現有的平滑化技術雖已能有效的解決數據稀疏問題,但對已出現的事件頻率分佈的合理性沒有作出有效的分析,於是我們從統計觀點對平滑化處理進行分析,並且提出一些性質分析上述平滑化方法的統計行為。接下來提出兩種全新的平滑化方法,這兩種平滑化方法能夠同時滿足我們提出的性質。 最後,我們從大量中文語料中建立模型,並且討論如何使用交叉熵和混淆度評價模型,以及對Katz所提出的Cut offs議題做出相關的討論。
Smoothing technique is a very fundamental and important topic. Many applications like speech reconition, machine translation, input method, Chinese characters conversion use this technique a lot. In this thesis, we discuss the properties and entropies of smoothing methods. Because of the problem of data sparseness, smoothing methods are employed to estimate the probability of each event in language models. We will mention several well-known smoothing methods: Additive Discount Method, Good-Turing Method and Witten-Bell method. The present smoothing techniques have solved the data sparse problem effectively but have not further anzlyzed the reasonableness for the frequency distribution of events occurring.So we analyzed smoothing method from a statitiscal point of view. We propose a set of properties to analyzed the statistical bebaviors of these smoothing methods. Furthmore, we present two new smoothing methods which comply with nearly all the properties. Finally, we implement the language models using large Mandarin corpus and discuss how to evaluate language models by cross-entropy and perplexity. Then we discuss some related problems of the cut off issues proopsed by Katz.
URI: http://hdl.handle.net/11455/19866
其他識別: U0005-1508201215490600
文章連結: http://www.airitilibrary.com/Publication/alDetailedMesh1?DocID=U0005-1508201215490600
Appears in Collections:資訊網路與多媒體研究所

文件中的檔案:

取得全文請前往華藝線上圖書館



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.