Please use this identifier to cite or link to this item: http://hdl.handle.net/11455/19537
標題: 利用組合式估算與項目集分群的高頻項目集挖掘法
Mining Frequent Itemsets Using Combinatorial Approximation and Itemset Clustering
作者: 張銘元
CHANG, MING-YUAN
關鍵字: support approximation
支持度估算
clustering
data mining
combinatorial approximation
frequent itemset
分群
資料挖掘
組合式估算
高頻項目集
出版社: 資訊科學與工程學系所
引用: 1. A. Adhikari, and P.R. Rao, Efficient clustering of databases induced by local patterns, Decision Support Systems 44 (4) (2008) 925-943. 2. R. Agrawal and R. Srikant, Fast algorithms for mining association rules, in: Proc. Conf. VLDB, 1994, 487-499. 3. C. C. Aggarwal and P. S. Yu, A new approach to online generation of association rules, IEEE Trans. Knowledge and Data Engineering 13 (4) (2001) 527-540. 4. V. Baez-Monroy and S. O''Keefe, The identification and extraction of itemset support defined by the weight matrix of a self-organising map, in: Proc. Int. Joint Conf. Neural Networks, 2006, 3518-3525. 5. F. Bodon, A fast APRIORI implementation, in: Proc. IEEE ICDM Workshop Frequent Itemset Mining Implementations, 2003. 6. F. Bodon, Surprising results of trie-based FIM algorithms, in: Proc. IEEE ICDM Workshop Frequent Itemset Mining Implementations, 2004. 7. J.-F. Boulicaut, A. Bykowski, and C. Rigotti, Approximation of frequency queries by means of free-Sets, in: Proc. Conf. Principles of Data Mining and Knowledge Discovery, 2000, 75-85. 8. J.-F. Boulicaut, A. Bykowski, and C. Rigotti, Free-Sets: A condensed representation of boolean data for the approximation of frequency queries, Data Mining and Knowledge Discovery, 7 (1) (2003) 5-22. 9. H. Cheng, P. S. Yu, J. Han, AC-Close: Efficiently mining approximate closed itemsets by core pattern recovery, in: Proc. IEEE Int. Conf Data Mining, 2006, 839-844. 10. C. Chiu and P.-L. Hsu, Constraint-based genetic algorithm approach for mining classification rules, IEEE Trans. Systems Man and Cybernetics Part C 35 (2) (2005) 205-220. 11. J. Han, J. Pei, and Y. Yin, Mining frequent patterns without candidate generation, in Proc. Conf. ACM SIGMOD, 2000, 1-12. 12. C.-H. Hu and S.-F. Su, Hierarchical clustering methods for semiconductor manufacturing data, in: Proc. IEEE Int. Conf. Networking, Sensing and Control, 2004, 1063-1068. 13. K.-F. Jea, M.-Y. Chang, and K.-C. Lin, An efficient and flexible algorithm for online mining of large itemsets, Information Processing Letters 92 (6) (2004) 311-316. 14. Y.-S. Lee and S.-J. Yen, Classification based on attribute dependency, Lecture notes in Computer Science: Data Warehousing and Knowledge Discovery 3181 (2004) 259-268. 15. B. Lent, A. Swami, and J. Widom, Clustering association rules, in: Proc. 13th Int. Conf. Data Engineering, 1997, 220-231. 16. W. Li, J. Han, and J. Pei, CMAR: Accurate and efficient classification based on multiple class-association rules, in: Proc. IEEE Int. Conf. Data Mining, 2001, 369-376. 17. D. I. Lin and Z.M. Kedem, Pincer-search: an efficient algorithm for discovering the maximum frequent set, IEEE Trans. Knowledge and Data Engineering 14 (3) (2002) 553-566. 18. N. Linial and N. Nisan, Approximate inclusion-exclusion, Combinatorica 10 (4) (1990) 349-365. 19. C. L. Liu, Introduction to Combinational Mathematics (1973). 20. B. Liu, W. Hsu, and Y. Ma, Integrating classification and association rule mining, in: Proc. Int. Conf. Knowledge Discovery and Data Mining, 1998, 80-86. 21. J. Liu, S. Paulsen, W. Wang, A. Nobel, and J. Prins, Mining approximate frequent itemset from noisy data, in: Proc. Fifth IEEE Int. Conf. Data Mining, 2005, 721-724. 22. J. Liu, S. Paulsen, X. Xu, W. Wang, A. Nobel, and J. Prins, Mining approximate frequent itemset in the presence of noise: algorithm and analysis, in: Proc. 6th SIAM Conf. Data Mining, 2006, 405-416. 23. H. Liu and L. Yu, Toward integrating feature selection algorithms for classification and clustering, IEEE Trans. Knowledge and Data Engineering 17 (4) (2005) 491-502. 24. C. Ordonez and E. Omiecinski, Efficient disk-based K-means clustering for relational databases, IEEE Trans. Knowledge and Data Engineering 16 (8) (2004) 909-921. 25. J. S. Park, M. S. Chen, and P. S. Yu, An effective hash based algorithm for mining association rules, in: Proc. Conf. ACM SIGMOD, 1995, 175-186. 26. J. Pei, J. Han, and R. Mao, CLOSET: An efficient algorithm for mining frequent closed itemsets, in: Proc. ACM SIGMOD Data Mining and Knowledge Discovery, 2000, 21-30. 27. C. J. van Rijsbergen, Information Retrieval (Butterworths, London, 1979). 28. T.-W. Ryu and C. F. Eick, A database clustering methodology and tool, Information Sciences 171 (2005) 29-59. 29. A. Savasee, E. Omiecinski, and S. Navathe, An efficient algorithm for mining association rules in large databases, in: Proc. Conf. VLDB, 1995, 432-444. 30. Y. J. Tsay and Y. W. Chang-Chien, An efficient cluster and decomposition algorithm for mining association rules, Information Sciences 160 (2004) 161-171. 31. J. Wang; J. Han, C. Li, Frequent Closed Sequence Mining without Candidate Maintenance, IEEE Trans. Knowledge and Data Engineering 19 (8) (2007) 1042-1056. 32. P. Whittle, Probability via Expectation (New York, 2000). 33. Y. Xu, J. X. Yu, G, Liu, and H. Lu, From path tree to frequent patterns: a framework for mining frequent patterns, in: Proc. IEEE Int. Conf. Data Mining, 2002, 514-521. 34. C. Yang, U. Fayyad, and P. S. Bradley, Efficient discovery of error-tolerant frequent itemsets in high dimensions, in: Proc. Int. Conf. ACM SIGKDD Knowledge Discovery and Data Mining, 2001, 194-203. 35. X. Yang, Z. Wang, B. Liu, S. Zhang, W. Wang, and B. Shi, Non-almost-derivable frequent itemsets mining, in: Proc. Int. Conf. Computer and Information Technology, 2005, 157-161. 36. S.-J. Yen and A. L. P. Chen, An efficient approach to discovering knowledge from large databases in: Proc. Fourth Int. Conf. Parallel and Distributed Information Systems, 1996, 8-18. 37. M. L. Yiu and N. Mamoulis, Iterative projected clustering by subspace mining, IEEE Trans. Knowledge and Data Engineering 17 (2) (2005) 176-189. 38. H. Yu, J. Han, and K. C.-C. Chang, PEBL: Web page classification without negative examples, IEEE Trans. Knowledge and Data Engineering 16 (1) (2004) 70-81. 39. M. J. Zaki and C.-J. Hsiao, Efficient algorithms for mining closed itemsets and their lattice structure, IEEE Trans. Knowledge and Data Engineering 17 (4) (2005) 462-478. 40. M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, New algorithms for fast discovery of association rules, in: Proc. Int. Conf. Knowledge Discovery and Data Mining, 1997, 283-286. 41. M. J. Zaki and M. Peters, CLICKS: Mining subspace clusters in categorical data via K-partite maximal cliques, in: Proc. Conf. Data Engineering, 2005, 355-356. 42. Frequent itemset mining implementations repository. http://fimi.cs.helsinki.fi/. 43. IBM almaden-quest data mining synthetic data generation code. http://www.almaden.ibm.com/software/quest/. 44. Frequent pattern mining implementations (the implementation of FP-growth). http://www.adrem.ua.ac.be/~goethals/software/.
摘要: 資料挖掘技術可以迅速找到資料庫中諸如高頻項目集、關聯法則等重要的資訊。然而,資料庫經過長年累月的長時間累積,資料量變得相當龐大,這時候資料挖掘的工作就變得相當耗時;尤其在最小支持度降低時,高頻項目集的挖掘工作就需要漫長的等待時間。因此,如何開發一個快速又不需太多記憶體空間的高頻項目集資料挖掘方法,就成為一個相當重要的問題。 本論文提出一個組合式估算法來計算項目集的支持度,透過排容原理,我們可以快速地估算出項目集的支持度,並產生高頻項目集。這個方法不必使用資料庫去驗證項目集的支持度,可以有效地提升資料挖掘的效率。此外我們又提出一個項目集分群的技術,來提升估算時的準確度。利用這個項目集分群技術,我們有效地排除相關性較低的項目集,並將高相關的項目集分在同一群,再去作估算的程序,降低估算的誤差。此外,本論文並提出一種特殊的資料儲存結構,於分群及搜尋時,不需要執行太多額外的工作,相較於其他演算法,本方法可以達到相當穩定的執行效率。根據實驗結果,適當地調整分群項目集的相似度,本方法可以達到99% 的準確度。由於本方法擁有效率快及準確度高的優點,可以應用在資料挖掘,讓使用者可以快速又準確的挖掘高頻項目集。
Association rule mining (ARM) usually requires a long running time to discover frequent itemsets from databases with a huge number of transactions against low minimum supports. To speed up this task, we propose a new concept based on support approximation (using the Principle of Inclusion and Exclusion) to generate frequent itemsets. Based on this concept, frequent itemsets can be discovered by calculating their approximated supports without verifying those in a database. To accomplish this, we combine a new clustering technique with support approximation and propose a novel mining method, CAC, to discover frequent itemsets. Clustering techniques group highly similar members to improve the accuracy of support approximation. With a special data structure, the SC table, the clustering technique consumes less additional cost. The hit-ratio analysis and experimental results presented in this dissertation verify that CAC improves accuracy to 99% on average while increasing the similarity threshold. Without repeatedly scanning a database and storing vast information in memory, the CAC method is able to mine frequent itemsets with relative stability. The advantages that the CAC method enjoys in both accuracy and performance make it an effective and useful technique for discovering frequent itemsets in a huge database.
URI: http://hdl.handle.net/11455/19537
其他識別: U0005-1707200812440500
文章連結: http://www.airitilibrary.com/Publication/alDetailedMesh1?DocID=U0005-1707200812440500
Appears in Collections:資訊科學與工程學系所

文件中的檔案:

取得全文請前往華藝線上圖書館

Show full item record
 
TAIR Related Article
 
Citations:


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.