Please use this identifier to cite or link to this item:
標題: A Chinese Unknown Words Extraction Model for The Blog Connect
應用於Blog Connect的中文未知詞擷取模型
作者: 黃政傑
Jeng Jie Huang
關鍵字: Unknown word
Chinese segmentation
Queried keyword
引用: [1] Ali-Hasan, N.,& Adamic, E. (2008). Expressing Social Relationships on the Blog through links and comments. [Online]. Available: [Accessed 9 6 2008]. [2] Academia Sinica Balanced Corpus( in chinese as '中央研究院平衡語料庫), [Online]. Available: [3] Bojars,U., Breslin, J. G., Peristeras,V., Tummarello,G.,& Decker,S. (2008). Interlinking the Social Web with Semantics. Journal of IEEE Intelligent Systems 2008( pp. 29-40.) [4] 'Blog Connect,' [Online]. Available: [5]Chen,Y. H., Lu,J. L., & Tsai,M. F.(2013). Finding Keywords in Blogs: Efficient Keyword Extraction in Blog Mining via User Behaviors. SCI.( pp. 663-670.) [6] Chen,Y. H., Lu,J. L., & Huang, J.J. (2014). Analysis Chinese Sgmentation Systems on Queried Keywords. International Conference on Information Management. [7] Chen, K. J. , &Liu, S. H. (1992). Word identification for Mandarin Chinese sentences. Fifth International Conference on Computational Linguistics.( pp. 101-107.) [8] Chen, K.J. ,& Ma, W.Y. (2003). Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff. In Proceedings of SIGHAN, pp. 168-171. [9] Chen, K.J. ,& Ma, W.Y. (2002).Unknown Word Extraction for Chinese Documents. In Proceedings of COLING.( pp. 169-175.) [10] Chen, K.J.,& Ma, W.Y. (2003). A bottom-up Merging Algorithm for Chinese Unknown Word Extraction. In Proceedings of SIGHAN. [11] Chen, H.H., & Lee, J.C. (1996). Identification and Classification of Proper Names in Chinese Texts. In Proceedings of the 16th conference on Computational linguistics.( pp. 222-229.) [12] Chen, K.J. ,& Chen, C.J. (2000). Knowledge Extraction for Identification of Chinese Organization Names. In Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics.(pp. 15-21.) [13] Chen, K.J.,& Bai, M.H. (1998).Unknown Word Detection for Chinese by a Corpus-based Learning Method. International Journal of Computational linguistics and Chinese Language Processing. (pp. 27-44.) [14] Chang, T.H., & Lee, C.H. (2003). AUTOMATIC CHINESE UNKNOWN WORD EXTRACTION USING. Natural Language Processing and Knowledge Engineering. [15] Church, K., Gale, W., Hanks, P.,& Hindle, D. (1991).Using Statistics in Lexical Analysis. Lawrence Erlbaum Associates Publishers.(pp. 115-164.) [16] Common english names, [Online]. Available: [17] Erdmann, M.,& Studer, R. (2001). How to Structure and Access XML Documents with Ontologies. Data Knowledge Engineering (36:3)(pp.317-335.) [18] Fan, C. K., & Tsai, W. H. (1998). Automatic Word Identification in Chinese Sentences by the Relaxation Technique. Computer Proceeding of Chinese and Oriental Languages.( pp. 33-56.) [19] Gao,J., & Lai,W. ( 2010). Formal Concept Analysis Based Clustering for Blog Network Visualization. Proceedings of International Conference on Advanced Data Mining and Applications.( pp. 394-404.) [20] Goh, C.L., Asahara, M., & Matsumoto,Y. (2006). Machine Learning-based Methods to Chinese Unknown Word Detection and POS Tag Guessing. International Journal of Chinese Language and Computing(pp. 185-206.) [21] Gao, J.M.,&Lin, C.L.Coupus Constrction'(in chinese as '語料庫建構技術'),' [Online]. Available: [22] Hu, X.,& Wu,B. (2006). Automatic Keyword Extraction Using Linguistic Features. In: Proceedings of the Sixth IEEE International Conference on Data Mining-Workshop(ICDMW).(pp. 19-23.) [23] Jiang, X., Wang, L., Cao,Y.,& Lu,Z. (2011). Automatic Recognition of Chinese Unknown Word for Single-Character and Affix Models. Knowledge Engineering and Management, AISC. [24] Johnson, N. (2008).Google on User Intent in Search Queries, Search Engine Watch. [Online]. Available: [25] Lu,L.,& Zhu,F. (2010). Blogger clustering by utilizing link information. Proceedings of IEEE International Conference on Intelligent Computing and Intelligent Systems(ICIS)( pp. 267-270.) [26] Larsen,B., & Aone,C. (1999). Fast and Effective Text Mining Using Linear-time Document Clustering. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge discovery and Data Mining(KDD '99).( pp. 16-22.) [27] Li,H., & Yuan,B. Chinese Word Segmentation. Proceedings of the 12th Paci Asia Conference on Language Information and Computation.(pp. 212-217.) [28] Lai. Y.,& Wu, C. (2000). Unknown Word and Phrase Extraction Using a Phrase-Like-Unit Based Likelihood Ration. Iutemational Joumal of Computer Processing . (pp. 83-95.) [29] Li, H., Huang, C.-H., Gao, J., & Fan, X. (2005). The Use of SVM for Chinese New Word Identification. LNCS. (pp. 723-732.) [30] Li, B.-I. (1991). A maximal matching automatic Chinese word Segmentation algorithm using corpus tagging for ambiguity resolution. R.O.C. Computational Linguistics Conference.( pp. 135-146.) [31] Lo, C.H., Huang, W.C.,& Chen, H.L.(2011). Construction of Semantic and Sentence Patterns Retrieval Service System. Journal of Information Management(in Chinese as(資訊管理學報),vo.18. [32] Nie, J., Briscbois, M.,& Ren, X. (1996). On Chinese Text Retrieval. Conference Proceedings of SIGIR. (pp. 225-233.) [33] Ohtsuki, K., Matsuoka, T., Matsunaga, S., & Furui, S. (1998). Topic extraction with multiple topic-words in broadcast-news speech. In : Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP).(pp. 329-332.) [34] Sobel, J. (2011). State of the Blogosphere 2011: Introduction and Methodology. [Online]. Available: [35] Surnames ( in chinese as '百家姓).[Online] Available: [36] Tsai,C.H.,2000, 'MMSEG4J: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching,'. [Online]. Available: [37] Wu, Y., Hsieh, C., Lin, W., Liu, C., & Yu, L. (2011). Unknown word extraction from multilingual code-switching sentences. In Proceedings of the 23rd conference on computational linguistics and speech processing (pp. 349–360). [38] Word List withAccumulated Word Frequency in Sinica Corpus 3.0( in chinese as 中央研究院平衡語料庫詞集及詞頻統計). [Online]. Available: [39] Yang, C.C., & Chang, C. H. (2008).A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning. The 13th conference on Artiticial Intelligence and Application. [40] Zhu, Q., Cheng, X. Y.,& Gao, Z. (2001). The Recognition Method of Unknown Chinese Words in Fragments Based On Mutual Information. Knowledge Engineering and Management.AISC.
摘要: Since there are no blanks to mark word boundaries in original Chinese texts, the main goal of Chinese Words Segmentation is the identification of words. One of the major problems in word segmentation is unknown word (occurrence of out-of-vocabulary word). Based on our observation, the queried keywords collected from Blog Connect are mostly incomplete sentence. However, Chinese unknown word extraction methods are mainly designed for processing complete sentence. Therefore, we propose a Chinese unknown word extraction model for the Blog Connect. We utilize a two-phase approach to solve the unknown words problem: the first phase for unknown word detection and second phase for unknown word extraction. In detection phase, we use characteristic's frequency and probability of queried keywords to establish rules. These rules can distinguish whether a queried keyword is including unknown words or not. In extraction phase, we propose a variant of bottom up merging algorithm with rules to get unknown words recursively. The experimental results (F-measure 76.75%) show that our method can increase the performance of Chinese word segmentation for queried keywords. There are 988 unknown words in our experimental data, our method can get the 689 unknown words, but CKIP can only get 573 unknown words.
由於中文的原始呈現方式並不像歐美語系一樣,每個字詞之間都有空白(blank)做區隔,所以在處理中文資料的過程中,中文斷詞是一個極重要的環節,而中文斷詞主要的問題之一即是'未知詞'的處理。一般來說,傳統的未知詞擷取方法,主要是針對一篇的文章,且以句子為處理單位,從中擷取未知詞;然而由Blog Connect平台收集使用者查詢某一部落格文章所用的查詢關鍵字卻不是完整的句子。為此我們提出了一個未知詞擷取的方法,希望從查詢關鍵字中擷取出未知詞,進而提高查詢關鍵字的斷詞正確性。在本篇論文中,我們將未知詞分兩階段進行偵測與擷取。在未知詞偵測階段,我們利用查詢關鍵字集的特性及關鍵字的頻率設立條件,來偵測出可能含有未知詞的查詢關鍵字。在未知詞擷取階段,以我們提出的algorithm搭配處理規則,以遞迴的方式來擷取未知詞。實驗結果顯示,我們的方法可以幫助提高查詢關鍵字的斷詞正確性,其F-measure高達76.75%。另外我們的方法優於有未知詞辨識的斷詞系統CKIP,在實驗資料中總共有988個未知詞,我們的方法擷取出689個未知詞,而CKIP只擷取出573個未知詞。
其他識別: U0005-2811201416175782
文章公開時間: 2016-08-31
Appears in Collections:資訊管理學系



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.