Please use this identifier to cite or link to this item:
標題: 分散式環境下基於序列型樣之 KNN 查詢
Distributed KNN Query Processing for Sequences with Similar Patterns
作者: 萬年松
Wan, Nian-Song
關鍵字: 分散式資料庫;Distributed Databases;漸進式;二分探測;序列型樣探勘;KNN 查詢;Incremental;Binary Probing;Sequence Pattern Mining;KNN Query
出版社: 電機工程學系所
引用: [1] Shuang Bai and Si-Xue Bai. The maximal frequent pattern mining of dna sequence. In Proceedings of IEEE International Conference on Granular Computing, pages 23–26, Aug. 2009. [2] Fosca Giannotti, Mirco Nanni, Fabio Pinelli, and Dino Pedreschi. Trajectory pattern mining. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 330–339, 2007. [3] Jiong Yang and Meng Hu. Trajpattern: Mining sequential patterns from imprecise trajectories of mobile objects. In Proceedings of the 10th International Conference on Advances in Database Technology, pages 664–681, 2006. [4] CISCO. Cisco visual networking index: Global mobile data traffic forecast update, 2011 to 2016. Technical report, CISCO, Feb. 2012. [5] Ming-Yen Lin and Suh-Yin Lee. Improving the efficiency of interactive sequential pattern mining by incremental pattern discovery. In Proceedings of the 36th Annual Hawaii International Conference on System Sciences, page 68b, Jan. 2003. [6] Darya Chudova and Padhraic Smyth. Pattern discovery in sequences under a markov assumption. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 153–162, 2002. [7] P. A. Vijaya, M. Narasimha Murty, and D. K. Subramanian. An efficient technique for protein sequence clustering and classification. In Proceedings of the 17th International Conference on Pattern Recognition, pages 447–450, Washington, DC, USA, 2004. IEEE Computer Society. [8] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, September 1999. [9] Charu C. Aggarwal and Philip S. Yu. Outlier detection for high dimensional data. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pages 37–46. ACM, 2001. [10] Mi-Yen Yeh, Kun-Lung Wu, Philip S. Yu, and Ming-Syan Chen. Leewave: Level-wise distribution of wavelet coefficients for processing knn queries over distributed streams. Proceedings of the VLDB Endowment, 1(1):586–597, Aug. 2008. [11] Elke Achtert, Christian B‥ohm, Peer Kr‥oger, Peter Kunath, Alexey Pryakhin, and Matthias Renz. Efficient reverse k-nearest neighbor search in arbitrary metric spaces. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, pages 515–526. ACM, 2006. [12] Xiaoyan Liu and Hakan Ferhatosmanoglu. Efficient k-nn search on streaming data series. In SSTD, volume 2750, pages 83–101. Springer, 2003. [13] Kun-Lung Wu, Shyh-Kwei Chen, and Philip S. Yu. Incremental processing of continual range queries over moving objects. IEEE Trans. on Knowledge and Data Engineering, 18(11):1560–1575, November 2006. [14] George Beskales, Mohamed A. Soliman, and Ihab F. IIyas. Efficient search for the top-k probable nearest neighbors in uncertain databases. VLDB Endowment, 1(1):326–339, August 2008. [15] Dana Ron, Yoram Singer, and Naftali Tishby. Learning probabilistic automata with variable memory length. In Proceedings of the 7th Annual Conference on Computational learning theory, pages 35–46, New York, NY, USA, 1994. ACM. [16] Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering, pages 3–14. IEEE Computer Society, 1995. [17] Ramakrishnan Srikant and Rakesh Agrawal. Mining sequential patterns: Generalizations and performance improvements. In Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology, pages 3–17. Springer-Verlag, 1996. [18] Jian Pei, Jiawei Han, Behzad Mortazavi-asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, andMei chun Hsu. Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In Proceedings of the 17th International Conference on Data Engineering, pages 215–224, Washington, DC, USA, 2001. IEEE Computer Society. [19] A. Mohapatra, P.M. Mishra, and S. Padhy. Motif search in dna sequences using generalized suffix tree. In Proceedings of the 10th International Conference on Information Technology, pages 100 –103, Dec. 2007. [20] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, and Pang-Ning Tan. Web usage mining: Discovery and applications of usage patterns from web data. ACM SIGKDD Explorations Newsletter, 1(2):12–23, January 2000. [21] Ron Begleiter, Ran El-Yaniv, and Golan Yona. On prediction using variable order markov models. Artificial Intelligence Research, 22(1):385–421, December 2004. [22] C.C. Hung, C.W. Chang, and W.C. Peng. Mining trajectory profiles for discovering user communities. In Proceedings of The International Workshop on Location Based Social Networks, pages 1–8. ACM, 2009. [23] Chunxi Chen and Bertil Schmidt. Parallel construction of large suffix trees on a pc cluster. In Proceedings of the 11th International Euro-Par Conference on Parallel Processing, pages 1227–1236, 2005. [24] Y.Y. Li, M. Thomason, and L.E. Parker. Detecting time-related changes in wireless sensor networks using symbol compression and probabilistic suffix trees. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2946–2951. IEEE, 2010. [25] G. Bejerano and G. Yona. Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics, 17(1):23–43, 2001. [26] Alberto Apostolico and Gill Bejerano. Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. In Proceedings of the 4th Annual International Conference on Computational Molecular Biology, pages 25–32. ACM, 2000. [27] J. Ziv. On finite memory universal data compression and classification of individual sequences. IEEE Trans. on Information Theory, 54(4):1626–1636, April 2008. [28] Dimitrios Katsaros and Yannis Manolopoulos. A suffix tree based prediction scheme for pervasive computing environments. In Proceedings of the 10th Panhellenic Conference on Advances in Informatics, pages 267–277. SpringerVerlag, 2005. [29] K. Yildiz and M. Sert. Predicting motifs in human and mouse genes by using probabilistic suffix trees. In Proceedings of the 15th Nitional Biomedical Engineering Meeting, pages 1–4. IEEE, 2010. [30] Gill Bejerano and Golan Yona. Modeling protein families using probabilistic suffix trees. In Proceedings of the third annual international conference on Computational molecular biology, RECOMB ’99, pages 15–24, New York, NY, USA, 1999. ACM. [31] Hsiao-Ping Tsai, De-Nian Yang, and Ming-Syan Chen. Mining group movement patterns for tracking moving objects efficiently. IEEE Trans. on Knowledge and Data Engineering, 23:266–281, 2011. [32] D. Katsaros and Y. Manolopoulos. Prediction in wireless networks by markov chains. Wireless Communications, 16(2):56–64, 2009. [33] Yinian Qi and Mikhail J. Atallah. Efficient privacy-preserving k-nearest neighbor search. In Proceedings of The 28th International Conference on Distributed Computing Systems, pages 311–319. IEEE Computer Society, 2008. [34] Li Xiong, Subramanyam Chitti, and Ling Liu. Mining multiple private databases using a knn classifier. In Proceedings of the ACM Symposium on Applied Computing, pages 435–440. ACM, 2007. [35] M. Shaneck, Y. Kim, and V. Kumar. Privacy preserving nearest neighbor search. In Proceedings of the ICDM Workshops, pages 541–545. IEEE, 2006. [36] Apostolos N. Papadopoulos and Yannis Manolopoulos. Distributed processing of similarity queries. Distributed and Parallel Databases, 9(1):67–92, January 2001. [37] K.P. Chan and A.W.C. Fu. Efficient time series matching by wavelets. In Proceedings of 15th International Conference on Data Engineering, pages 126–133. IEEE, 1999. [38] Christos Faloutsos, M. Ranganathan, and Yannis Manolopoulos. Fast subsequence matching in time-series databases. In Proceedings of The ACM SIGMOD International Conference on Management of Data, pages 419–429. ACM, 1994. [39] Hui Ding, Goce Trajcevski, Peter Scheuermann, Xiaoyue Wang, and Eamonn Keogh. Querying and mining of time series data: experimental comparison of representations and distance measures. VLDB Endowment, 1(2):1542–1552, August 2008. [40] J. Yang and W. Wang. Cluseq: Efficient and effective sequence clustering. In Proceedings of the 19th International Conference on Data Engineering, pages 101–112. IEEE, 2003. [41] Hsiao-Ping Tsai, De-Nian Yang, and Ming-Syan Chen. Exploring applicationlevel semantics for data compression. IEEE Trans. on Knowledge and Data Engineering, 23:95–109, 2011.
序列資料在我們的日常生活中隨處可見,像是物件移動軌跡、生物基因序列、金融或購買商品交易記錄等等,涵蓋的範圍極廣,具有極高的研究價值。序列型樣分析 (Sequence Pattern Analysis) 是資料探勘最重要的研究領域之一,主要是要找出隱藏在序列資料中的特徵 (型樣),例如 DNA 或蛋白質序列的 Motif、物件移動軌跡的規律性或瀏覽網頁的習慣性等等。序列資料的特徵比原始序列更容易理解且更具代表性,經常用於資料的檢索和查詢。近來,隨著生物晶片、感測和無線網路等相關技術的進步,大量序列資料快速累積。序列資料有著大量且分散的特性,傳統以單一伺服器儲存和處理的方式已無法負荷,因此以分散式資料庫來儲存序列資料是必然的趨勢。

面對龐大且分散的資料,如何進行有效率的查詢是一個重要且困難的課題。本論文的研究主題主要是在分散式環境中基於序列型樣進行比對的 KNN 查詢問題。為解決此問題,我們提出分散式漸進二元探測 (Distributed and Incremental KNN query with Binary Probing, DIKNN-BP) 演算法,其以 Probabilistic Suffix Tree (PST) 建構演算法探勘序列資料的型樣,並以漸進式的方式傳送查詢目標的型樣給分散的資料庫,各資料庫根據已接收的型樣進行相似度的上下限的估計,藉以排除不合格候選物件,最後回傳與目標物件最相似的 K 個物件。此外,我們根據所的提出的七個定理設計二元探測法,以加速上下限的收斂和KNN的查詢。根據實驗結果顯示 DIKNN-BP 演算法可以有效節省傳送型樣數及資料傳輸量,比起傳統的查詢方式更能節省查詢時間。

Sequence data are ubiquitous in our daily life, such as i.e., moving object trajectories、biological gene sequences、records of the commodity purchasing, which are of high research value. Sequence Pattern Analysis is one of the most important research fields in data mining. Its goal is to discover important features (patterns) hidden in sequence data, such as movement regularity in trajectories, motif in DNA or protein sequences, or purchasing behavior in transaction sequences. Recently, with the advance of DNA microarray chips, sensing, and wireless techniques, the huge amounts of sequence data are rapidly accumulated. Since a centralized server cannot afford the huge and ubiquitous sequence data, a distributed approach is the natural trend.

Facing the huge and distributed sequence data, how to query data efficiently is an important and difficulty problem. In the thesis, we focus on the problem of distributed KNN query of sequences with similar patterns. To solve this problem, we propose a distributed and incremental KNN query with binary probing (DIKNN-BP) algorithm. It mines sequence patterns by using Probabilistic Suffix Tree(PST)and transmits pattern to remote servers in a progressive manner. The remote servers estimate the upper and lower bound of similarity value between target object and remote candidate objects and based on which to prune unqualified candidates. In addition, we derive five theorems and propose a binary probing approach to speed up the converge of upper and lower bound, also the KNN query. According to the experimental results, our DIKNN-BP algorithm can obtain the KNN solution with fewer patterns transmitted and with less data transmitted. Compared with a naive approach, it achieves a shorter query execution time.
其他識別: U0005-2408201218402100
Appears in Collections:電機工程學系所

Show full item record

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.