標題: A Chinese Unknown Words Extraction Model for The Blog Connect
應用於Blog Connect的中文未知詞擷取模型
作者: 黃政傑
Jeng Jie Huang
關鍵字: Unknown word
Chinese segmentation
Queried keyword
摘要: Since there are no blanks to mark word boundaries in original Chinese texts, the main goal of Chinese Words Segmentation is the identification of words. One of the major problems in word segmentation is unknown word (occurrence of out-of-vocabulary word). Based on our observation, the queried keywords collected from Blog Connect are mostly incomplete sentence. However, Chinese unknown word extraction methods are mainly designed for processing complete sentence. Therefore, we propose a Chinese unknown word extraction model for the Blog Connect. We utilize a two-phase approach to solve the unknown words problem: the first phase for unknown word detection and second phase for unknown word extraction. In detection phase, we use characteristic's frequency and probability of queried keywords to establish rules. These rules can distinguish whether a queried keyword is including unknown words or not. In extraction phase, we propose a variant of bottom up merging algorithm with rules to get unknown words recursively. The experimental results (F-measure 76.75%) show that our method can increase the performance of Chinese word segmentation for queried keywords. There are 988 unknown words in our experimental data, our method can get the 689 unknown words, but CKIP can only get 573 unknown words.
由於中文的原始呈現方式並不像歐美語系一樣,每個字詞之間都有空白(blank)做區隔,所以在處理中文資料的過程中,中文斷詞是一個極重要的環節,而中文斷詞主要的問題之一即是'未知詞'的處理。一般來說,傳統的未知詞擷取方法,主要是針對一篇的文章,且以句子為處理單位,從中擷取未知詞;然而由Blog Connect平台收集使用者查詢某一部落格文章所用的查詢關鍵字卻不是完整的句子。為此我們提出了一個未知詞擷取的方法,希望從查詢關鍵字中擷取出未知詞,進而提高查詢關鍵字的斷詞正確性。在本篇論文中,我們將未知詞分兩階段進行偵測與擷取。在未知詞偵測階段,我們利用查詢關鍵字集的特性及關鍵字的頻率設立條件,來偵測出可能含有未知詞的查詢關鍵字。在未知詞擷取階段,以我們提出的algorithm搭配處理規則,以遞迴的方式來擷取未知詞。實驗結果顯示,我們的方法可以幫助提高查詢關鍵字的斷詞正確性,其F-measure高達76.75%。另外我們的方法優於有未知詞辨識的斷詞系統CKIP,在實驗資料中總共有988個未知詞,我們的方法擷取出689個未知詞,而CKIP只擷取出573個未知詞。
其他識別: U0005-2811201416175782
文章公開時間: 2016-08-31
