Please use this identifier to cite or link to this item: http://hdl.handle.net/11455/19803
標題: 應用多詞及多詞性語言模型的中文斷詞及詞性標記方法
Applying nWord and nPOS Language Models to Word Segmentation and Part of Speech Tagging for Chinese
作者: 賴亦傑
Lai, Yi-Chieh
關鍵字: Chinese word segmentation
中文斷詞
part of speech tagging
Chinese TreeBank
nWord
nPOS
詞性標記
中文句結構樹
多詞
多詞性
出版社: 資訊網路多媒體研究所
引用: [1] Jianfeng Gao, Mu Li, Andi Wu, and Chang-Ning Huang, “Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach”, Computational Linguistics 2005 December, Vol. 31, No. 4, pp.531-574. [2] 中央研究院現代漢語標記語料庫3.0版。http://dbo.sinica.edu.tw/ftms-bin/kiwi1/mkiwi.sh [3] Chinese GigaWord語料庫第三版。 http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T38 [4] 江昶毅,”應用多種特徵的中文斷詞及詞性標記方法”,國立中興大學資訊工程研究所碩士論文,2010年。 [5] SIGHAN, http://www.sighan.org/bakeoff2006/ [6] Nianwen Xue, “Chinese Word Segmentation as Character Tagging”, Computational Linguistics 2003 February, Vol. 8, No. 1, pp.29-48. [7] 羅永聖,”結合多類型字典與條件隨機域之中文斷詞與詞性標記系統硏究”,國立台灣大學資訊工程研究所碩士論文,2008年。 [8] Guhong Fu and K.K. Luke., ”A Two-Stage Statistical Word Segmentation System for Chinese”, Proceeding of The Second SIGHAN Workshop on Chinese Language Processing 2003, Vol. 17, pp.156-159. [9] 林千翔,”Chinese Word Segmentation using Specialized HMM”,國立中央大學資訊工程研究所碩士論文,2005年。 [10] Keh-Jiann Chen and Shing-Huan Liu, “Word Identification For Mandarin Chinese Sentences”, Proceedings of COLING 1992, pp.101-107. [11] Jian-Yun Nie, Marie-Louise Hannan, and Wanying Jin, “Unknown Word Detection and Segmentation of Chinese Using Statistical and Heuristic Knowledge”, Communications of COLIPS 1995, Vol. 5, pp.47-57. [12] Andi Wu, Zixin Jiang, ”Word Segmentation In Sentence Analysis”, International Conference on Chinese Information Processing in Beijing China 1998, pp.169-180. [13] Jian-feng Gao, Mu Li, and Chang-Ning Huang, “Improved Source-Channel Models for Chinese Word Segmentation”, the 41st Annual Meeting on Association for Computational Linguistics 2003, Vol. 1, pp.272-279. [14] 平衡語料庫簡介 http://db1x.sinica.edu.tw/cgi-bin/kiwi/mkiwi/mkiwi.sh [15] 中央研究院資訊所、語言所詞庫小組所編技術報告第 95-02/98-04號「中央研究院漢語料庫的內容與說明」。 [16] 中研院線上斷詞器 http://ckipsvr.iis.sinica.edu.tw/ [17] 中華郵政資訊網 http://www.post.gov.tw/post/internet/down/index.html [18] 教育部統計處 http://www.edu.tw/statistics/index.aspx [19] 國家教育究院 http://terms.nict.gov.tw/index.php [20] 張家銘,”中文人名擷取”,國立中興大學資訊工程研究所碩士論文,2007年。 [21] Shih-Min Li, Su-Chu Lin, Chia-Hung Tai, and Keh-Jiann Chen.” A Probe Into Ambiguities of Determinative-Measure Compounds”, International Journal of Computational Linguistics and Chinese Language Processing 2006 August, Vol.11, No. 3, pp.245-280. [22] 中央研究院-中文剖析樹檢索系統《數位典藏國家型科技計畫》版權http://turing.iis.sinica.edu.tw/treesearch/ [23] Boost C++ Library http://www.boost.org/ [24] OpenMP http://openmp.org/wp/ [25] Google http://www.google.com.tw/ [26] Keh-Jiann Chen and Wei-Yun Ma.”Unknown Word Extraction for Chinese Documents”, Proceedings of COLING 2002, Vol. 1, pp.169-175. [27] 梁婷、葉大榮,”應用構詞法則與類神經網路於中文新詞萃取”, Proceedings of Research on Computational Linguistics Conference XIII (ROCLING XIII) in 2000 August, pp.21-40.
摘要: 中文斷詞及詞性標記是語音語言處理中非常根本且重要的課題,許多相關的應用如語音辨識、線上翻譯、語法剖析都需要使用或參考到斷詞結果。因此斷詞的準確度就影響到系統的效能。 在本論文中,我們採用兩階段式斷詞及詞性標記方法。在第一階段嘗試找出最佳斷詞結果。此階段我們使用多詞(nWords)方式做斷詞。發現最好的結果是使用混合式bigram機率模型及搭配1words和2words的斷詞法。外部測試在F分數上可以達到96.69%。但我們發現使用中研院提供的中文句結構樹(TreeBank v3.1)來篩選nWords後,搭配unigram機率模型也能得到不錯的效果。接著在第二階段時根據第一階段的斷詞結果去標記詞性,嘗試使用詞性間的四元接續關係(4-gram)、nPOS、nWord-POS去做詞性標記的實驗,發現使用n-Word-POS訓練語料搭配詞性的三元接續關係可得到一個不錯的詞性標記效果。 最後我們將中文斷詞及標記詞性系統展示在線上,提供給其它人參考。
Chinese word segmentation and POS tagging is a very fundamental and important topic. Many applications like speech recognition, online translation and grammar analysis use this technique a lot. Hence the quality of word segmentation and POS tagging is highly related to system performance. In this thesis, we use a two-stage method of word segmentation and POS tagging. In the first stage we try to find a very good nWords language model to segment the words. We found that using mixed bigram model with 1Words and 2Words can have the best result. And the F score is 96.69%. Also we found that when we first use Sinica Chinese TreeBank(v3.1) to filter our nWords data, and then use unigram model combined with the filtered nWords to segment sentences, the result is good, too. And then in second stage we tag POS for each word found in the first stage. We try to use POS quadgram model, nPOS model, and nWord-POS model to experiment our POS tagging performance. We found that using nWord-POS data and POS trigram model can have a good pergormance. Finally we demonstrate our Chinese word segmentation and POS tagging on-line. We hope this system can provide more available information to the people.
URI: http://hdl.handle.net/11455/19803
其他識別: U0005-1608201112415500
文章連結: http://www.airitilibrary.com/Publication/alDetailedMesh1?DocID=U0005-1608201112415500
Appears in Collections:資訊網路與多媒體研究所

文件中的檔案:

取得全文請前往華藝線上圖書館



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.