標題: Document Overlapping Clustering Using Formal Concept Analysis
作者: 林于婷
Yu-Ting Lin
關鍵字: Overlapping Clustering;Formal Concept Analysis;Clustering;Text document clustering;重疊式分群;正規概念分析;分群;文件分群
In recent year, information and data growing and spreading fast. A lot of studies trying to find the useful pattern or knowledge among the growing data. Text document clustering is also a technique in Data Mining field which could solve this problem.
Text document clustering is a technique which group documents into several clusters based on the similarities among documents. Most of traditional clustering algorithms build disjoint clusters, but clusters should be overlapped because document may often belong to two or more categories in real world. For example, an article discussing the Apple Watch may be categorized into either 3C, Fashion, or even Clothing and Shoes. Then this article could be seen by more internet users.
In this paper, we propose an overlapping clustering algorithm by using the Formal Concept Analysis, which could make an article belongs to two or more cluster. Due to the hierarchical structure of Formal Concept Lattice, an article could belong to more than one Formal Concept. Extracting the suitable Formal Concepts and transformed into conceptual vectors, the overlapping clustering result could be obtained. More over, our algorithm reduced the dimension of the vector space, it performs more efficiently than traditional clustering approaches which are based on Vector Space Model.

近年來網路資料量的成長與資訊的傳播愈來愈快,許多研究試圖使用資料挖掘(Data Mining)的方法,從這些資料中找出有用的知識或規則,文件分群(Text Document Clustering)也是其中一種解決方式。透過相似度的計算,文件分群法可以將相似的文件群組在一起成為群集。過去的研究大部分只能將文件歸類到單一的群集,但其實在現實生活中,單一文件很有可能需要被歸類到二個以上的類別。舉例來說,一篇談論Apple Watch 的文章,可能必須被同時歸類到「3C」、「流行」、「時尚」或甚至是「服飾」等類別,這篇文章才有機會被更多網路使用者瀏覽。本研究結合正規概念分析法(Formal Concept Analysis),將原始資料集轉換為概念方格(Concept Lattice)之階層式架構,選取適當的正規概念轉換成概念向量(Conceptual Vector)再進行分群,能使一份文件不只屬於單一群集,本研究提出之方法可結合多種傳統分群法計算,達成重疊分群的結果。此外,過去的文件分群演算法容易產生計算空間過大,或是向量維度過高的問題,本研究之方法也對此做了改善,大幅降低了二維向量所佔的計算空間,使得分群更有效率。
