Please use this identifier to cite or link to this item: http://hdl.handle.net/11455/51588
標題: A Combinatorial-Approximation-Based Load Shedder for Data-Stream Frequent-Pattern Mining Systems
運用組合估算理論為資料串流頻繁樣式探勘系統所設計之負載控制機制
作者: 賈坤芳
關鍵字: 資訊科學--軟體;資料探勘;Data Mining;資料串流;資料串流探勘;頻繁樣式;組合估算;負載控制;應用研究;Data Stream;Data-stream Mining;Frequent Pattern;Combinatorial Approximation;Load Shedding
摘要: 
Data mining is a process of finding interesting knowledge from a world of (raw) data stored in databases.Recently, knowledge discovery communities have focused on a new model of data processing, where dataarrive in the form of continuous streams. It is often referred to as data streams or streaming data. A datastream is a massive and usually unbounded sequence of data elements which are continuously transmitted ata rapid rate and in a random order. Data streams have wide applications in real world in recent years, such astransactional records of a chain-store head office from every retail sales, web-flow or click-stream records ofa website monitoring server, etc. There is possibly some hidden information in these streaming data whichare valuable but not easy to find out. The natural feature of variable transit-rate of data streams has broughtout many constraints; moreover, streams have an uncertain factor that their peak volume may burst to severaltimes during a spike. As a result, data mining in data streams is much more difficult than that in the(traditional) databases. For processing data streams, there is a special technique called “load shedding” toaddress the overloading problem resulting from streams' variable high-speed. However, there are only fewresearches on load shedding for mining in data streams so far.In this proposal, we propose a load-shedding mechanism for mining frequent patterns in data streams.Different from other existing mechanisms, the proposed mechanism raises the throughput of a miningalgorithm by increasing its processing rate on each stream element. The proposal is the first one to study theload-shedding problems in data-stream mining of frequent patterns. We shall design a novel and general loadshedder based on the theory of combinatorial approximation. The load shedder suits for a mining methodbelonging to the types of ε-deficient mining or sliding-window exact mining. As the load shedder is working,the mining method needs to record fewer itemsets for each transaction than it does before, and the overload isthen solved with the increasing rate of data processing. For those itemsets that have not been kept or beenignored, their support-values can be approximated through combinatorial approximation when required. Theresearch in this proposal shows some important issues about applying combinatorial approximation as ameans of load shedding. Besides, the result of this research can be applied to many existing miningalgorithms, improving the scalability and practicality of data-stream mining systems. For those applicationshaving a high-degree demand for data-stream mining results, our load-shedding mechanism will bring themessential helpfulness.

「資料探勘」是一門從儲存於資料庫的龐大資料中挖掘出有趣知識的技術。近年來在許多實際應用中,資料不再是靜態儲存於資料庫中的型態,而是以串流的方式陸陸續續抵達本地端的系統,稱為「資料串流」的處理模型。資料串流是一個由(近乎)無限多的資料元件所形成之長串序列,以不定順序、非常快的速度產生,並且持續不斷地傳輸。資料串流在現實生活中的應用非常廣泛,因為在串流資料當中可能潛藏著具有價值的資訊,然而要發現它們並不容易。由於資料串流具有傳輸速率隨時間變動、尖峰時刻資料量暴增等不安定因素,使得資料串流探勘比起(傳統)資料庫探勘要來得困難許多。欲解決這個現實問題,需要使用「負載控制」這項專門技術,然而目前鮮有對於資料串流探勘的負載控制研究成果。針對傳輸速率不固定的資料串流環境,本計畫提出一套適用於探勘頻繁項目集的負載控制機制。這套機制的想法有別於既有做法,藉由加快探勘方法處理串流資料元件的速率,提升單位時間吞吐量,達到減低負載量的效果。本計畫首先針對資料串流探勘領域的負載控制技術進行研究,設計創新且通用性高的負載控制方法;這套方法以「組合估算」理論為基礎,可以與ε-deficient mining 或sliding-window exact mining 類型的不特定探勘演算法結合,為其提供負載控制功能。當負載控制機制啟動時,探勘方法對於每一筆串流交易將只需要記錄較原本少量的項目集,以提升的資料處理速率來降低過載的資料量;而未記錄或被略過的項目集可以在有需要時透過組合估算方法迅速求得。在學術上,本計畫針對組合估算技術應用於負載控制,提出並探討相關研究議題,展示新研究方向。在產業上,對於對資料串流探勘有高度需求的應用,本計畫的研究成果可適用於多數既有探勘演算法,提升資料串流探勘系統的延展性與實用性,為這些應用帶來實質幫助。
URI: http://hdl.handle.net/11455/51588
其他識別: NSC99-2221-E005-087
Appears in Collections:資訊科學與工程學系所

Show full item record
 

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.