Please use this identifier to cite or link to this item:
標題: 利用一Hadoop MapReduce鏈於建構機率尾置樹之研究
Building a Probabilistic Suffix Tree Using a Hadoop MapReduce Chain
作者: 陳映廷
Ying Ting Chen
關鍵字: Trajectory data mining
Cloud computing
Probabilistic Suffix Tree
引用: ? 參考文獻 [1] D. Ron, Y. Singer, and N. Tishby, 'Learning probabilistic automata with variable memory length,' COLT, 1994. [2] Paul Bieganski, John Ned1, John V. Cadis, Ernest E Retzel, 'Generalized Suffix Trees for Biological Sequence Data: Applications and Implementation, ' System Sciences, 1994. Proceedings of the Twenty-Seventh Hawaii International Conference on , vol.5, no., pp.35-44, 4-7 Jan. 1994. [3] Brian D O'Connor, Barry Merriman, Stanley F Nelson, 'SeqWare Query Engine: storing and searching sequence data in the cloud', BMC Bioinformatics 2010, 11(Suppl 12):S2 [4] Ming-Syan Chen; Jong Soo Park; Yu, P.S.; , 'Efficient data mining for path traversal patterns,' Knowledge and Data Engineering, IEEE Transactions on , vol.10, no.2, pp.209-221, Mar/Apr 1998. [5] Demiriz, A.; , 'webSPADE: a parallel sequence mining algorithm to analyze web log data,' Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on , vol., no., pp. 755- 758, 2002. [6] Jo Ting, Tak-chung Fu, and Fu-lai Chung. 'Mining of Stock Data: Intra- and Inter-Stock Pattern Associative Classification', In: Proceedings of 2006 International Conference on Data Mining, Las Vegas, USA, June 2006, pp. 30–36 (2006) [7] Daxin Jiang; Chun Tang; Aidong Zhang; , 'Cluster analysis for gene expression data: a survey,' Knowledge and Data Engineering, IEEE Transactions on , vol.16, no.11, pp. 1370- 1386, Nov. 2004. [8] Srivatsan Laxman; Sastry, P.S.; Unnikrishnan, K.P.; , 'Discovering frequent episodes and learning hidden Markov models: a formal connection,' Knowledge and Data Engineering, IEEE Transactions on , vol.17, no.11, pp. 1505- 1517, Nov. 2005. [9] Singh, S.; Donat, W.; Pattipati, K.; Willett, P.; Haiying Tu; , 'Anomaly Detection via Feature-Aided Tracking and Hidden Markov Models,' Aerospace Conference, 2007 IEEE , vol., no., pp.1-18, 3-10 March 2007. [10] Rao Kosaraju, S.; , 'Faster algorithms for the construction of parameterized suffix trees,' Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on , vol., no., pp.631-638, 23-25 Oct 1995. [11] Jingyi Yang; Deogun, J.; , 'G Protein-Coupled Receptor Classification at the Subfamily Level with Probabilistic Suffix Tree,' Computational Intelligence and Bioinformatics and Computational Biology, 2006. CIBCB '06. 2006 IEEE Symposium on , vol., no., pp.1-8, 28-29 Sept. 2006. [12] Yildiz, K.; Sert, M.; , 'Predicting motifs in human and mouse genes by using Probabilistic Suffix Trees,' Biomedical Engineering Meeting (BIYOMUT), 2010 15th National , vol., no., pp.1-4, 21-24 April 2010. [13] YuanYuan Li; Thomason, M.; Parker, L.E.; , 'Detecting time-related changes in Wireless Sensor Networks using symbol compression and Probabilistic Suffix Trees,' Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, vol., no., pp.2946-2951, 18-22 Oct. 2010. [14] Mcintosh, T.; Chawla, S.; , 'High Confidence Rule Mining for Microarray Analysis,' Computational Biology and Bioinformatics, IEEE/ACM Transactions on , vol.4, no.4, pp.611-623, Oct.-Dec. 2007. [15] Douzono, H.; Tokushima, H.; Hara, S.; Noguchi, Y.; , 'A design method of DNA chips using hierarchical self-organizing maps,' Neural Information Processing, 2002. ICONIP '02. Proceedings of the 9th International Conference on , vol.5, no., pp. 2244- 2248 vol.5, 18-22 Nov. 2002. [16] Bogorny, V.; Shekhar, S.; , 'Spatial and Spatio-temporal Data Mining,' Data Mining (ICDM), 2010 IEEE 10th International Conference on , vol., no., pp.1217, 13-17 Dec. 2010. [17] 巨型資料的超級挑戰來臨了,, 1 August. 2013 [18] Facebook談Hadoop,Hive,HBase和A/B測試,, 1 August 2013 [19] Sato, H.; Nanri, T.; Shimasaki, M.; , 'Design and implementation of PVM-based portable distributed shared memory system on the workstation cluster environment,' Parallel and Distributed Systems, 1997. Proceedings., 1997 International Conference on , vol., no., pp.578-583, 10-13 Dec 1997. [20] Richard, G.C., III; Singhal, M.; , 'Using logging and asynchronous checkpointing to implement recoverable distributed shared memory,' Reliable Distributed Systems, 1993. Proceedings., 12th Symposium on , vol., no., pp.58-67, 6-8 Oct 1993. [21] Chen, D.; Bu-Sung Lee; Wentong Cai; Turner, S.J.; , 'Design and development of a cluster gateway for cluster-based HLA distributed virtual simulation environments,' Simulation Symposium, 2003. 36th Annual , vol., no., pp. 193- 200, 30 March-2 April 2003. [22] Patrick, D.; Green, P.; York, T.; , 'A distributed genetic algorithm environment for UNIX workstation clusters,' Genetic Algorithms in Engineering Systems: Innovations and Applications, 1997. GALESIA 97. Second International Conference On (Conf. Publ. No. 446) , vol., no., pp.69-74, 2-4 Sep 1997. [23] Farach, M.; Ferragina, P.; Muthukrishnan, S.; , 'Overcoming the memory bottleneck in suffix tree construction,' Foundations of Computer Science, 1998. Proceedings.39th Annual Symposium on , vol., no., pp.174-183, 8-11 Nov 1998. [24] Hae-Won Choi; Hyun-Sung Kim; , 'The Suffix Tree Construction for Large Sequences over Hand-held Device,' Information Science and Security, 2008. ICISS. International Conference on , vol., no., pp.192-196, 10-12 Jan. 2008. [25] Apache Hadoop,, 1 August. 2013 [26] Apache HBase,, 1 August. 2013 [27] Apache Thrift,, 1 August. 2013 [28] F. Gao and M. J. Zaki. Psist: A scalable approach to indexing protein structures using suffix trees. 68(1):5463, January 2008. [29] C. Chen and B. Schmidt. Parallel construction of large suffix trees on a pc cluster. 3648:643{643, 2005. [30] M. Clark. Post congress tristesse. In TeX90 Conference Proceedings, pages 84{89. TeX Users Group, March 1991. [31] Zhong, W.; Altum, G.; Harrison, R.; Tai, P.C.; Pan, Y.; , 'Mining protein sequence motifs representing common 3D structures,' Computational Systems Bioinformatics Conference, 2005. Workshops and Poster Abstracts. IEEE , vol., no., pp. 215- 216, 8-11 Aug. 2005. [32] Man Yat Lo; Lucas, S.M.; , 'Evolving Musical Sequences with N-Gram Based Trainable Fitness Functions,' Evolutionary Computation, 2006. CEC 2006. IEEE Congress on , vol., no., pp.601-608, 0-0 0. [33] Greenspan, H.; Goldberger, J.; Mayer, A.; , 'Probabilistic space-time video modeling via piecewise GMM,' Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.26, no.3, pp.384-396, March 2004. [34] Wang, Chong; Wang, Yanqing; , 'Discovering consumer's behavior changes based on purchase sequences,' Fuzzy Systems and Knowledge Discovery (FSKD), 2012 9th International Conference on , vol., no., pp.642-645, 29-31 May 2012. [35] Dianhui Wang; Nung Kion Lee; Dillon, T.S.; , 'Data mining for building neural protein sequence classification systems with improved performance,' Neural Networks, 2003. Proceedings of the International Joint Conference on , vol.3, no., pp. 1746- 1751 vol.3, 20-24 July 2003. [36] de Souza Rodrigues, T.; Cardoso, F.C.; Ribeiro Teixeira, S.M.; Oliveira, S.C.; Braga, A.P.; , 'Protein Classification with Extended-Sequence Coding by Sliding Window,' Computational Biology and Bioinformatics, IEEE/ACM Transactions on , vol.8, no.6, pp.1721-1726, Nov.-Dec. 2011. [37] Jingke Xi; , 'Outlier Detection Algorithms in Data Mining,' Intelligent Information Technology Application, 2008. IITA '08. Second International Symposium on , vol.1, no., pp.94-97, 20-22 Dec. 2008. [38] Koufakou, A.; Secretan, J.; Reeder, J.; Cardona, K.; Georgiopoulos, M.; , 'Fast parallel outlier detection for categorical datasets using MapReduce,' Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on , vol., no., pp.3298-3304, 1-8 June 2008. [39] Zhipeng Cai; Lizhe Xu; Yi Shi; Salavatipour, M.R.; Goebel, R.; Guohui Lin; , 'Using Gene Clustering to Identify Discriminatory Genes with Higher Classification Accuracy,' BioInformatics and BioEngineering, 2006. BIBE 2006. Sixth IEEE Symposium on , vol., no., pp.235-242, 16-18 Oct. 2006. [40] Nagar, A.; Al-Mubaid, H.; , 'Using path length measure for gene clustering based on similarity of annotation terms,' Computers and Communications, 2008. ISCC 2008. IEEE Symposium on , vol., no., pp.637-642, 6-9 July 2008. [41] Chen, X.; Murphy, R.F.; , 'Location proteomics: determining the optimal grouping of proteins according to their subcellular location patterns as determined from fluorescence microscope images,' Signals, Systems and Computers, 2004. Conference Record of the Thirty-Eighth Asilomar Conference on , vol.1, no., pp. 50- 54 Vol.1, 7-10 Nov. 2004. [42] Chih-Chieh Hung; Wen-Chih Peng; , 'A Storage Management for Mining Object Moving Patterns in Object Tracking Sensor Networks,' Web Intelligence and Intelligent Agent Technology Workshops, 2006. WI-IAT 2006 Workshops. 2006 IEEE/WIC/ACM International Conference on , vol., no., pp.252-258, Dec. 2006. [43] Gaol, F.L.; , 'Exploring the Pattern of Habits of Users Using Web log Squential Pattern,' Advances in Computing, Control and Telecommunication Technologies (ACT), 2010 Second International Conference on , vol., no., pp.161-163, 2-3 Dec. 2010. [44] Chih-Chin Liu; Jia-Lien Hsu; Chen, A.L.P.; , 'Efficient theme and non-trivial repeating pattern discovering in music databases,' Data Engineering, 1999. Proceedings., 15th International Conference on , vol., no., pp.14-21, 23-26 Mar 1999. [45] Agrawal, R. and R. Srikant. Mining sequential patterns. in Data Engineering, 1995. Proceedings of the Eleventh International Conference on. 1995. IEEE. [46] Han, J., et al. FreeSpan: frequent pattern-projected sequential pattern mining. in Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. 2000. ACM. [47] Zaki, M.J., SPADE: An efficient algorithm for mining frequent sequences. Machine learning, 2001. 42(1-2): p. 31-60. [48] A. Ghoting and K. Makarychev. Prospects and limitations of full-text index structures in genome analysis. Nucleic Acids Research, page 123, May 2012. [49] L. Lamport. LaTeX User's Guide and Document Reference Manual. Addison-Wesley Publishing Company, Reading, Massachusetts, 1986. [50] Qi Zheng; Shengyi Jiang; , 'Intrusion Detection Based on Variable Prefix of System Call,' Networks Security Wireless Communications and Trusted Computing (NSWCTC), 2010 Second International Conference on , vol.1, no., pp.398-401, 24-25 April 2010. [51] Koit, M.; , 'Automatic recognition of dialogue acts in complex typology,' Innovations in Intelligent Systems and Applications (INISTA), 2011 International Symposium on , vol., no., pp.485-489, 15-18 June 2011. [52] Po-Ruey Lei; Tsu-Jou Shen; Wen-Chih Peng; Ing-Jiunn Su; , 'Exploring Spatial-Temporal Trajectory Model for Location Prediction,' Mobile Data Management (MDM), 2011 12th IEEE International Conference on , vol.1, no., pp.58-67, 6-9 June 2011. [53] Koul, N.; Bui, N.; Honavar, V.; , 'Scalable, updatable predictive models for sequence data,' Bioinformatics and Biomedicine (BIBM), 2010 IEEE International Conference on , vol., no., pp.681-685, 18-21 Dec. 2010. [54] H.-P. Tsai, D.-N. Yang, M.-S. Chen,, 'Exploring Group Moving Pattern for an Energy-Constrained Object Tracking Sensor Network,' PAKDD, 2007. [55] H.-P. Tsai, D.-N. Yang, M.-S. Chen, 'Mining Group Relationship in an Object Tracking Sensor Network,' TJCCT, 2007. [56]H.-P. Tsai, D.-N. Yang, M.-S. Chen, 'Mining Group Movement Patterns for Tracking Moving Objects Efficiently,' accepted by IEEE TKDE, 2009. [57] H.-P. Tsai, D.-N. Yang, M.-S. Chen, 'Exploring Application Level Semantics for Data Compression,' accepted by IEEE TKDE, 2009. [58] W.-C. Peng, M.-S. Chen, 'Developing data allocation schemes by incremental mining of user moving patterns in a mobile computing system,' IEEE TKDE, vol. 15, pp. 70-85, 2003. [59] R. Agrawal and R. Srikant, 'Mining sequential patterns,' ICDE, 1995. [60] M.-S. Chen, J. S. Park, P. S. Yu, 'Efficient data mining for path traversal patterns,'Knowledge and Data Engineering, vol. 10, no. 2, pp. 209-221, 1998.
摘要: 序列資料(Sequence Data)是指具有順序關係的資料紀錄,序列資料在日常生活中是無處不在的,例如:蛋白質序列、股市交易、網頁瀏覽紀錄、移動軌跡、基因序列等。序列型樣探勘(Sequence Pattern Mining)是序列資料分析的技術之一,其主要目的是從序列資料中挖掘出隱藏其中的特殊、重要且具代表性的特徵(feature),近來,此技術被廣泛的應用在生物資訊領域與時空軌跡資料的分析上,是資料探勘非常熱門的領域之一。 機率尾置樹(Probabilistic Suffix Tree, PST) 是一種Variable-Length Markov Chain (VMM)的實作,它被廣泛的應用在序列型樣探勘(Sequence Pattern Mining),一般認為,PST對序列資料的結構特徵或轉換行為,具有良好的擷取能力,非常適合用於預測和相似度比對。然而,隨著定位(Positioning)技術、感測(Sensing)技術的成熟,和無線(Wireless)技術演進,大量的序列資料被快速的累積,而傳統集中式的計算方法已經不堪負荷如此龐大資料(Big Data)的分析,因此,運算分散式運算雲端平台進行Big Data的計算是未來的趨勢。在本論文中,我們首先利用一個Na?ve方法來分析從Big Data建置機率尾置樹的困難點,為了解決Na?ve方法的缺陷,我們根據機率尾置樹的特性提出了一個新的CloudPST_OddEven MapReduce演算法,其包含一個MapReduce Chain,由四組MapReduce Tasks組成,分別負責型樣出現次數統計、尾置樹節點的組合、條件判斷等。 為了驗證CloudPST_OddEven演算法的效率,我們建置雲端Hadoop平台,並設計資料產生器產生合成的序列資料,並進行一系列的實驗;實驗結果顯示,CloudPST_OddEven演算法克服了Na?ve方法的瓶頸,展現不錯的效率。
其他識別: U0005-2811201416183816
文章公開時間: 10000-01-01
Appears in Collections:電機工程學系所



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.