Please use this identifier to cite or link to this item:
標題: Document Overlapping Clustering Using Formal Concept Analysis
作者: 林于婷
Yu-Ting Lin
關鍵字: Overlapping Clustering;Formal Concept Analysis;Clustering;Text document clustering;重疊式分群;正規概念分析;分群;文件分群
引用: [1] H.-C. Chien, Automatically Categorizing Blog Articles Using Ontology Tree Built by DBpedia, 2014. [2] K. R. Ayyasamy, S. M. Alhashmi, E.-G. Siew and B. Tahayna, 'Clustering Blogs Using Document Context Similarity and Spectral Graph Partitioning,' Knowledge Engineering and Management, pp. 475-486, 2011. [3] J.-C. Jehng, S. Chou, C.-Y. Cheng and J.-S. Heh, 'An Evaluation of the Formal Concept Analysis-based Document Vector on Document Clustering,' in International Conference on Computational Science and Its Applications (ICCSA), Santander, Spain, 2011. [4] C. Luo, Y. Li and S. M. Chung, 'Text document clustering based on neighbors.,' Data & Knowledge Engineering, pp. 1271-1288, 2009. [5] A. Jain, M. Murty and P. Flynn, 'Data clustering: a review,' ACM computing surveys (CSUR), pp. 264-323. [6] P. Butka and J. Pocsova, 'Hierarchical FCA-based conceptual model of text documents used in information retrieval system,' in 6th IEEE International Symposium, 2011. [7] Y.-H. Chen, E. J.-L. Lu and T.-Y. Wu, 'A Blog Clustering Approach Based on Queried Keywords,' International Symposium on Biometrics and Security Technologies, 2013. [8] P. Han, D.-B. Wang and Q.-G. Zhao, 'The research on Chinese document clustering based on WEKA,' in International Conference on Machine Learning and Cybernetics (ICMLC), Guilin, 2011. [9] R. Gil-García and A. Pons-Porrata, 'Dynamic hierarchical algorithms for document clustering.,' Pattern Recognition Letters 31.6 (2010), pp. 469-477, 2010. [10] F. Bonchi, A. Gionis and A. Ukkonen, 'Overlapping correlation clustering,' Knowledge and information systems, pp. 1-32, 2013. [11] A. Perez-Suarez, J. F. Martinez-Trinidad and J. A. Carrasco-Ochoa, 'OClustR: A new graph-based algorithm for overlapping clustering,' Neurocomputing, no. 121, pp. 234-247, 9 12 2013. [12] G. Tsoumakas and I. Katakis, 'Multi-label classification: An overview,' International Journal of Data Warehousing and Mining, pp. 1-13, 2007. [13] A. Pérez-Suárez, J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa and J. E. Medina-Pagola, 'OClustR: A new graph-based algorithm for overlapping clustering.,' Neurocomputing, pp. 234-247, 2013. [14] C. C. Aggarwal and C. Zhai, 'A Survey of Text Clustering Algorithms,' Mining Text Data, pp. 77-128, 2012. [15] W. Zhang, T. Yoshida, X. Tang and Q. Wang, 'Text clustering using frequent itemsets,' Knowledge-Based Systems, pp. 379-388, 23 January 2010. [16] wikipedia, 'Euclidean distance,' [Online]. Available: [17] wikipedia, 'Jaccard index,' [Online]. Available: [18] wikipedia, 'Cosine similarity,' [Online]. Available: [19] 張家寧, 陳信源, 葉鎮源, 黃明居, 柯皓仁 與 楊維邦, '以概念萃取為基礎之文件分群,' 於 資訊科技國際研討會論文集, 2008. [20] wikipedia, 'Precision and recall,' [Online]. Available: [21] wikipedia, 'F-measure,' [Online]. Available: [22] J. Yang and J. Leskovec, 'Overlapping community detection at scale: a nonnegative matrix factorization approach,' in Proceedings of the sixth ACM international conference on Web search and data mining, New York, NY, USA, 2013. [23] A. Banerjee, C. Krumpelman, S. Basu, R. J. Mooney and J. Ghosh, 'Model-based Overlapping Clustering,' in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, 2005. [24] M. K. Goldberg, M. Hayvanovych and M. Magdon-Ismail, 'Measuring similarity between sets of overlapping clusters,' in Social Computing (SocialCom), 2010 IEEE Second International Conference on, 2010. [25] 'Local search,' Wikipedia, [Online]. Available: [26] Y.-J. Horng, S.-M. Chen, Y.-C. Chang and C.-H. Lee, 'A new method for fuzzy information retrieval based on fuzzy hierarchical clustering and fuzzy inference techniques,' in Fuzzy Systems, IEEE Transactions , 2005. [27] R. Belohlavek, B. D. Baets and J. Konecny, 'Granularity of attributes in formal concept analysis,' Information Sciences, pp. 149-170, 2014. [28] F. Škopljanac-Mačina and B. Blašković, 'Formal Concept Analysis–Overview and Applications.,' Procedia Engineering, pp. 1258-1267, 2014. [29] R. Belohlavek and V. Vychodil, 'Formal concept analysis with background knowledge: attribute priorities,' in Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2009. [30] B. Díaz-Agudo and P. A. González-Calero, 'Classification based retrieval using formal concept analysis.,' Case-Based Reasoning Research and Development.Springer Berlin Heidelberg, pp. 173-188, 2001. [31] National Digital Archives Program, '中文斷詞系統,' [Online]. Available: [32] J. Gao and W. Lai, 'Formal Concept Analysis Based Clustering for Blog Network Visualization,' Lecture Notes in Computer Science, pp. 394-404, 2010. [33] R. Wille, 'Formal Concept Analysis as Mathematical Theory of Concepts and Concept Hierarchies,' Formal Concept Analysis, pp. 1-33, 2005. [34] B. Ganter and R. Wille, Formal concept analysis: mathematical foundations., Springer Science & Business Media, 2012. [35] K. A. Heller and Z. Ghahramani, 'Bayesian hierarchical clustering,' in Proceedings of the 22nd international conference on Machine learning, 2005. [36] Z. Sui, Q. Zhao and Y. Liu, 'Inducting Concept Hierarchies from Text based on FCA,' in Innovative Computing, Information and Control (ICICIC), 2009 Fourth International Conference on, 2009. [37] E. Amigó, J. Gonzalo, J. Artiles and F. Verdejo, 'A comparison of extrinsic clustering evaluation metrics based on formal constraints,' Information retrieval 12.4, pp. 461-486, 2009. [38] S. Godboke and S. Sunita, 'Discriminative Methods for Multi-labeled Classification,' in Proceedings of the 8th Pacific-Asia Conference on Knowledge Distcovery and Data Mining, India, 2004. [39] '第一屆痞客邦金點賞,' 痞客邦Pixnet, 2014. [Online]. Available: [40] 'colibri-java,' [Online]. Available: [41] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, 1988. [42] S. Michael, G. Karypis and V. Kumar, 'A comparison of document clustering techniques,' KDD workshop on text mining, pp. 525-526, 2000. [43] H. Yu, P. Jiao, G. Wang and Y. Yaoy, 'Categorizing Overlapping Regions in Clustering Analysis Using Three-Way Decisions,' in WI-IAT '14 Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2014. [44] Y. Yan, L. Chen and W.-C. Tjhi, 'Fuzzy semi-supervised co-clustering for text documents,' Fuzzy Sets and Systems, p. 74–89, 16 March 2013. [45] 'Conceptual clustering,' Wikipedia, [Online]. Available: [46] D. W. AHA, D. Kibler and M. K. Albert, 'Instance-based learning algorithms,' Machine learning, pp. 37-66, 1991. [47] A. Formica, 'Concept similarity in Formal Concept Analysis: An Information Content Approach,' Knowledge-Based Systems, pp. 80-87, 2008. [48] Technorati, 'State of the Blogsphere 2010 Report,' 2010. [Online]. Available: [49] G. H. John and L. Pat, 'Estimating continuous distributions in Bayesian classifiers,' in Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, 1995. [50] Y. Rezgui, 'Text-based domain ontology building using tf-idf and metric clusters techniques,' The Knowledge Engineering Review, pp. 379-403, 2007. [51] L. Talavera and J. Bejar, 'Generality-based conceptual clustering with probabilistic concepts,' in Pattern Analysis and Machine Intelligence, IEEE Transactions, 2001.
In recent year, information and data growing and spreading fast. A lot of studies trying to find the useful pattern or knowledge among the growing data. Text document clustering is also a technique in Data Mining field which could solve this problem.
Text document clustering is a technique which group documents into several clusters based on the similarities among documents. Most of traditional clustering algorithms build disjoint clusters, but clusters should be overlapped because document may often belong to two or more categories in real world. For example, an article discussing the Apple Watch may be categorized into either 3C, Fashion, or even Clothing and Shoes. Then this article could be seen by more internet users.
In this paper, we propose an overlapping clustering algorithm by using the Formal Concept Analysis, which could make an article belongs to two or more cluster. Due to the hierarchical structure of Formal Concept Lattice, an article could belong to more than one Formal Concept. Extracting the suitable Formal Concepts and transformed into conceptual vectors, the overlapping clustering result could be obtained. More over, our algorithm reduced the dimension of the vector space, it performs more efficiently than traditional clustering approaches which are based on Vector Space Model.

近年來網路資料量的成長與資訊的傳播愈來愈快,許多研究試圖使用資料挖掘(Data Mining)的方法,從這些資料中找出有用的知識或規則,文件分群(Text Document Clustering)也是其中一種解決方式。透過相似度的計算,文件分群法可以將相似的文件群組在一起成為群集。過去的研究大部分只能將文件歸類到單一的群集,但其實在現實生活中,單一文件很有可能需要被歸類到二個以上的類別。舉例來說,一篇談論Apple Watch 的文章,可能必須被同時歸類到「3C」、「流行」、「時尚」或甚至是「服飾」等類別,這篇文章才有機會被更多網路使用者瀏覽。本研究結合正規概念分析法(Formal Concept Analysis),將原始資料集轉換為概念方格(Concept Lattice)之階層式架構,選取適當的正規概念轉換成概念向量(Conceptual Vector)再進行分群,能使一份文件不只屬於單一群集,本研究提出之方法可結合多種傳統分群法計算,達成重疊分群的結果。此外,過去的文件分群演算法容易產生計算空間過大,或是向量維度過高的問題,本研究之方法也對此做了改善,大幅降低了二維向量所佔的計算空間,使得分群更有效率。
其他識別: U0005-2508201513394500
Rights: 同意授權瀏覽/列印電子全文服務,2015-08-26起公開。
Appears in Collections:資訊管理學系

Files in This Item:
File Description SizeFormat Existing users please Login
nchu-104-7102029029-1.pdf1.82 MBAdobe PDFThis file is only available in the university internal network   
Show full item record

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.