Please use this identifier to cite or link to this item: http://hdl.handle.net/11455/9047
標題: SeMiHBase: 運用雲端Hadoop與HBase於軌跡序列探勘系統之研發
SeMiHBase: Trajectory Sequence Mining System with Hadoop and HBase
作者: 吳柏翰
Wu, Bo-Han
關鍵字: 軌跡序列探勘;Cloud Computing;雲端計算;機率後置樹;Hadoop/MapReduce;Probabilistic Suffix Tree (PST)
出版社: 電機工程學系所
引用: [1] D. Ron, Y. Singer, and N. Tishby, "Learning probabilistic automata with variable memory length," COLT, 1994. [2] Paul Bieganski, John Ned1, John V. Cadis, Ernest E Retzel, “Generalized Suffix Trees for Biological Sequence Data: Applications and Implementation, ” System Sciences, 1994. Proceedings of the Twenty-Seventh Hawaii International Conference on , vol.5, no., pp.35-44, 4-7 Jan. 1994. [3] Brian D O''Connor, Barry Merriman, Stanley F Nelson, "SeqWare Query Engine: storing and searching sequence data in the cloud", BMC Bioinformatics 2010, 11(Suppl 12):S2 [4] Ming-Syan Chen; Jong Soo Park; Yu, P.S.; , "Efficient data mining for path traversal patterns," Knowledge and Data Engineering, IEEE Transactions on , vol.10, no.2, pp.209-221, Mar/Apr 1998. [5] Demiriz, A.; , "webSPADE: a parallel sequence mining algorithm to analyze web log data," Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on , vol., no., pp. 755- 758, 2002. [6] Jo Ting, Tak-chung Fu, and Fu-lai Chung. "Mining of Stock Data: Intra- and Inter-Stock Pattern Associative Classification", In: Proceedings of 2006 International Conference on Data Mining, Las Vegas, USA, June 2006, pp. 30–36 (2006) [7] Daxin Jiang; Chun Tang; Aidong Zhang; , "Cluster analysis for gene expression data: a survey," Knowledge and Data Engineering, IEEE Transactions on , vol.16, no.11, pp. 1370- 1386, Nov. 2004. [8] Srivatsan Laxman; Sastry, P.S.; Unnikrishnan, K.P.; , "Discovering frequent episodes and learning hidden Markov models: a formal connection," Knowledge and Data Engineering, IEEE Transactions on , vol.17, no.11, pp. 1505- 1517, Nov. 2005. [9] Singh, S.; Donat, W.; Pattipati, K.; Willett, P.; Haiying Tu; , "Anomaly Detection via Feature-Aided Tracking and Hidden Markov Models," Aerospace Conference, 2007 IEEE , vol., no., pp.1-18, 3-10 March 2007. [10] Rao Kosaraju, S.; , "Faster algorithms for the construction of parameterized suffix trees," Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on , vol., no., pp.631-638, 23-25 Oct 1995. [11] Jingyi Yang; Deogun, J.; , "G Protein-Coupled Receptor Classification at the Subfamily Level with Probabilistic Suffix Tree," Computational Intelligence and Bioinformatics and Computational Biology, 2006. CIBCB ''06. 2006 IEEE Symposium on , vol., no., pp.1-8, 28-29 Sept. 2006. [12] Yildiz, K.; Sert, M.; , "Predicting motifs in human and mouse genes by using Probabilistic Suffix Trees," Biomedical Engineering Meeting (BIYOMUT), 2010 15th National , vol., no., pp.1-4, 21-24 April 2010. [13] YuanYuan Li; Thomason, M.; Parker, L.E.; , "Detecting time-related changes in Wireless Sensor Networks using symbol compression and Probabilistic Suffix Trees," Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, vol., no., pp.2946-2951, 18-22 Oct. 2010. [14] Mcintosh, T.; Chawla, S.; , "High Confidence Rule Mining for Microarray Analysis," Computational Biology and Bioinformatics, IEEE/ACM Transactions on , vol.4, no.4, pp.611-623, Oct.-Dec. 2007. [15] Douzono, H.; Tokushima, H.; Hara, S.; Noguchi, Y.; , "A design method of DNA chips using hierarchical self-organizing maps," Neural Information Processing, 2002. ICONIP ''02. Proceedings of the 9th International Conference on , vol.5, no., pp. 2244- 2248 vol.5, 18-22 Nov. 2002. [16] Bogorny, V.; Shekhar, S.; , "Spatial and Spatio-temporal Data Mining," Data Mining (ICDM), 2010 IEEE 10th International Conference on , vol., no., pp.1217, 13-17 Dec. 2010. [17] 巨型資料的超級挑戰來臨了, http://www.ithome.com.tw/itadm/article.php?c=68277&s=2, 1 August. 2013 [18] Facebook談Hadoop,Hive,HBase和A/B測試, http://www.infoq.com/cn/news/2010/07/facebook-hadoop-summit, 1 August 2013 [19] Sato, H.; Nanri, T.; Shimasaki, M.; , "Design and implementation of PVM-based portable distributed shared memory system on the workstation cluster environment," Parallel and Distributed Systems, 1997. Proceedings., 1997 International Conference on , vol., no., pp.578-583, 10-13 Dec 1997. [20] Richard, G.C., III; Singhal, M.; , "Using logging and asynchronous checkpointing to implement recoverable distributed shared memory," Reliable Distributed Systems, 1993. Proceedings., 12th Symposium on , vol., no., pp.58-67, 6-8 Oct 1993. [21] Chen, D.; Bu-Sung Lee; Wentong Cai; Turner, S.J.; , "Design and development of a cluster gateway for cluster-based HLA distributed virtual simulation environments," Simulation Symposium, 2003. 36th Annual , vol., no., pp. 193- 200, 30 March-2 April 2003. [22] Patrick, D.; Green, P.; York, T.; , "A distributed genetic algorithm environment for UNIX workstation clusters," Genetic Algorithms in Engineering Systems: Innovations and Applications, 1997. GALESIA 97. Second International Conference On (Conf. Publ. No. 446) , vol., no., pp.69-74, 2-4 Sep 1997. [23] Farach, M.; Ferragina, P.; Muthukrishnan, S.; , "Overcoming the memory bottleneck in suffix tree construction," Foundations of Computer Science, 1998. Proceedings.39th Annual Symposium on , vol., no., pp.174-183, 8-11 Nov 1998. [24] Hae-Won Choi; Hyun-Sung Kim; , "The Suffix Tree Construction for Large Sequences over Hand-held Device," Information Science and Security, 2008. ICISS. International Conference on , vol., no., pp.192-196, 10-12 Jan. 2008. [25] Apache Hadoop, http://hadoop.apache.org/, 1 August. 2013 [26] Apache HBase, http://hbase.apache.org/, 1 August. 2013 [27] Apache Thrift, http://thrift.apache.org/, 1 August. 2013 [28] F. Gao and M. J. Zaki. Psist: A scalable approach to indexing protein structures using suffix trees. 68(1):5463, January 2008. [29] C. Chen and B. Schmidt. Parallel construction of large suffix trees on a pc cluster. 3648:643{643, 2005. [30] M. Clark. Post congress tristesse. In TeX90 Conference Proceedings, pages 84{89. TeX Users Group, March 1991. [31] H.-P. Tsai, D.-N. Yang, M.-S. Chen,, "Exploring Group Moving Pattern for an Energy-Constrained Object Tracking Sensor Network," PAKDD, 2007. [32] H.-P. Tsai, D.-N. Yang, M.-S. Chen, "Mining Group Relationship in an Object Tracking Sensor Network," TJCCT, 2007. [33]H.-P. Tsai, D.-N. Yang, M.-S. Chen, "Mining Group Movement Patterns for Tracking Moving Objects Efficiently," accepted by IEEE TKDE, 2009. [34] H.-P. Tsai, D.-N. Yang, M.-S. Chen, "Exploring Application Level Semantics for Data Compression," accepted by IEEE TKDE, 2009. [35] W.-C. Peng, M.-S. Chen, "Developing data allocation schemes by incremental mining of user moving patterns in a mobile computing system," IEEE TKDE, vol. 15, pp. 70-85, 2003. [36] R. Agrawal and R. Srikant, "Mining sequential patterns," ICDE, 1995. [37] M.-S. Chen, J. S. Park, P. S. Yu, "Efficient data mining for path traversal patterns,"Knowledge and Data Engineering, vol. 10, no. 2, pp. 209-221, 1998.
摘要: 
資料探勘 (Data Mining) 是利用各種分析工具擷取隱藏在大量資料中的知識,並且已廣泛地運用於許多領域,發展出創新和高附加價值的應用,其所運用的相關技術涵蓋了資料庫、統計、人工智慧、類神經網路、機器學習等。
序列 (sequence) 通常用來表示具有順序關係的資料,而序列資料無所不在,例如: 基因序列、蛋白質序列、移動軌跡、網頁的瀏覽紀錄、股票交易等。應用資料探勘的技術於序列資料的分析一直是熱門且重要的領域,其中序列型樣探勘 (Sequence pattern discovery) 是序列分析的技術之一,其目的是從序列資料中找出特殊、重要、具有代表性的特徵 (feature)。近來在生物資訊領域或時空軌跡資料分析等方面更吸引了大量的注意。由於Probabilistic Suffix Tree (PST)具有擷取結構特徵或轉換行為的能力,因此經常被用於序列型樣探勘。
隨著資料蒐集技術的進步,序列資料大量、快速累積,由於資料量非常龐大,使得傳統的單機或集中式的計算方法已經無法負荷。因此我們於雲端平台提出了一個CloudPST軌跡序列探勘系統,其整合Hadoop、HBase、Web Server以提供有效率的序列型樣探勘服務。具體來說,我們首先在Hadoop平台上提出一個平行分散式的PST建構演算法,稱為CloudPST。CloudPST符合MapReduce的框架,其權衡分散式計算的額外花費和避免無謂的計算,採漸進分層的策略建樹。其次,HBase是架構在HDFS上之雲端資料管理系統。為符合Hbase的特性,我們設計適合的key-value形式的資料格式,使軌跡資料的對應和存取更為容易有效率。最後,我們設計一Web-based使用者介面,讓使用者與雲端平台做溝通與互動,操作上傳軌跡資料、上傳事件資料、執行PST運算、查看PST運算狀態、進行預測等運算。
為進行效能驗證,我們實作兩個版本的CloudPST演算法,包括直接從檔案讀取輸入資料和從Hbase讀取資料。實驗結果顯示,由於HBase是架構在HDFS上,因此資料存取的成本比直接從HDFS讀取檔案較高,大約在1 GB的資料量時,直接從HDFS讀取檔案比從Hbase讀取資料快了10%。

Data mining is the use of many techniques, such as databases, statistics, artificial intelligence, neural networks, machine learning, to discover knowledge hidden in a large amount of data. Nowadays, data mining is emerging as a paradigm of data analysis and widely applied in many fields to develop innovative and value-added applications. Recently, sequence data are ubiquitous in our daily life and sequence data mining attracts a lot of attention especially in the domains of bioinformatics and spatio-temporal trajectory data analysis. To discover significant and representative features behind the data, Probabilistic Suffix Tree (PST) are frequently used due to their capability in capturing the structural characteristics and transition behavior in sequence data.
With the advance of data collection techniques, sequence data are accumulated rapidly and ubiquitously. However, traditional centralized PST learning algorithms do not scale well in learning huge sequence data. In view of this, we propose a CloudPST mining system that integrates Hadoop, HBase, Web Server to provide sequence data mining service. Specifically, in the system, we first propose a distributed and parallel PST building algorithm, named CloudPST, on Hadoop platform to speed up big data processing. CloudPST is a Map/Reduce framework that builds PSTs progressively to avoid learning excessive patterns and trade off the overhead of distributed computing. Second, HBase is database management system on top of HDFS. Conforming to the column-based property of HBase, we design suitable key-value form for easy of data mapping and retrieval. In addition, we design a Web-based interface between client ends and the cloud computing platform to provide users with data uploading, PST building, and mining status checking operations.
To evaluate the performance of our proposed system, we implement the CloudPST algorithm with input data from file and from Hbase. The experimental results show that CloudPST+File outperforms CloudPST+Hbase. The execution time is about 10% faster as the data size is about 1 GB.
URI: http://hdl.handle.net/11455/9047
其他識別: U0005-2808201316561800
Appears in Collections:電機工程學系所

Show full item record
 

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.