Please use this identifier to cite or link to this item: http://hdl.handle.net/11455/89055
標題: A Study on Multi-layered Automatic Book Classification System Using Data Mining
使用資料探勘探討多層式圖書自動分類系統之研究
作者: 吳慧貞
Huei-Chen Wu
關鍵字: 多層式圖書自動分類系統;投票策略;分類器;資料探勘;multi-layered automatic book classification system;voting strategy;classifier;data mining
引用: Aghdam, M. H., Ghasem-Aghaee, N., & Basiri, M. E. (2009). Text feature selection using ant colony optimization. Expert systems with applications, 36(3), 6843-6853. Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed, M., & Al-Rajeh, A. (2008). Automatic Arabic text classification. AL-Nabi, D. L. A., & Ahmed, S. S. (2013). Survey on Classification Algorithms for Data Mining:(Comparison and Evaluation). Computer Engineering and Intelligent Systems, 4(8), 18-24. Antonie, M.-L., & Zaiane, O. R. (2002). Text document categorization by term association. Paper presented at the Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on. Borko, H., & Bernick, M. (1963). Automatic document classification. Journal of the ACM (JACM), 10(2), 151-162. Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees: CRC press. Cheatham, M., & Rizki, M. (2006). Feature and prototype evolution for nearest neighbor classification of web documents. Paper presented at the Information Technology: New Generations, 2006. ITNG 2006. Third International Conference on. Chen, K.-h., & Wu, C.-t. (1999). Automatically Controlled-Vocabulary Indexing for Text Retrieval. Paper presented at the Proceedings of the 12th Research on Computational Linguistics Conference. Chou, C.-H., Han, C.-C., & Chen, Y.-H. (2007). GA based optimal keyword extraction in an automatic Chinese web document classification system. Paper presented at the Frontiers of High Performance Computing and Networking ISPA 2007 Workshops. Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. Information Theory, IEEE Transactions on, 13(1), 21-27. Denoyer, L., & Gallinari, P. (2004). Bayesian network model for semi-structured document classification. Information Processing & Management, 40(5), 807-827. Domingos, P., & Pazzani, M. (1996). Beyond independence: Conditions for the optimality of the simple bayesian classi er. Paper presented at the Proc. 13th Intl. Conf. Machine Learning. Drucker, H., Wu, D., & Vapnik, V. N. (1999). Support vector machines for spam categorization. Neural Networks, IEEE Transactions on, 10(5), 1048-1054. Duda, P. E., & Richard, O. (1973). Hart, Pattern Classification and Scene Analysis: John Wiley and Sons, New York. Escudero, G., Màrquez, L., & Rigau, G. (2000). Boosting applied to word sense disambiguation. Machine Learning: ECML 2000, 129-141. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37. Frawley, W. J., Piatetsky-Shapiro, G., & Matheus, C. J. (1992). Knowledge discovery in databases: An overview. AI magazine, 13(3), 57. Goebel, M., & Gruenwald, L. (1999). A survey of data mining and knowledge discovery software tools. ACM SIGKDD explorations newsletter, 1(1), 20-33. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: an update. ACM SIGKDD explorations newsletter, 11(1), 10-18. Hamill, K. A., & Zamora, A. (1980). The use of titles for automatic document classification. Journal of the American Society for Information Science, 31(6), 396-402. Han, J., & Kamber, M. (2006). Data Mining, Southeast Asia Edition: Concepts and Techniques: Morgan kaufmann. Hornik, K., Buchta, C., & Zeileis, A. (2009). Open-source machine learning: R meets Weka. Computational Statistics, 24(2), 225-232. Jaillet, S., Laurent, A., & Teisseire, M. (2006). Sequential patterns for text categorization. Intelligent Data Analysis, 10(3), 199-214. King, M. A., Elder IV, J. F., Gomolka, B., Schmidt, E., Summers, M., & Toop, K. (1998). Evaluation of fourteen desktop data mining tools. Paper presented at the Systems, Man, and Cybernetics, 1998. 1998 IEEE International Conference on. KNIME (Konstanz Information Miner),. Retrieved 20, October, 2013, from http://www.knime.org/ Kwok, K. (1975). The use of title and cited titles as document representation for automatic classification. Information Processing & Management, 11(8), 201-206. Lan, M., Tan, C. L., Su, J., & Lu, Y. (2009). Supervised and traditional term weighting methods for automatic text categorization. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(4), 721-735. Larkey, L. S. (1999). A patent search and classification system. Paper presented at the Proceedings of the fourth ACM conference on Digital libraries. Larson, R. R. (1992). Experiments in automatic library of congress classification. JASIS, 43(2), 130-148. Lee, C., & Lee, G. G. (2006). Information gain and divergence-based feature selection for machine learning-based text categorization. Information Processing & Management, 42(1), 155-165. Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval Machine learning: ECML-98 (pp. 4-15): Springer. Li, Y., Shiu, S. C.-K., Pal, S. K., & Liu, J. N.-K. (2006). A rough set-based case-based reasoner for text categorization. International journal of approximate reasoning, 41(2), 229-255. Linoff, G. S., & Berry, M. J. (2011). Data mining techniques: for marketing, sales, and customer relationship management: John Wiley & Sons. Lu, S.-H., Chiang, D.-A., Keh, H.-C., & Huang, H.-H. (2010). Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values. Knowledge-based systems, 23(6), 598-604. Ma, B. L. W. H. Y. (1998). Integrating classification and association rule mining. Paper presented at the Proceedings of the 4th. Orange-Data Mining Fruitful and Fun. Retrieved 20, October, 2013, from http://orange.biolab.si/ Pietramala, A., Policicchio, V. L., Rullo, P., & Sidhu, I. (2008). A genetic algorithm for text classification rule induction Machine Learning and Knowledge Discovery in Databases (pp. 188-203): Springer. Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1(1), 81-106. Quinlan, J. R. (1993). C4. 5: programs for machine learning (Vol. 1): Morgan kaufmann. Quinlan, R. (2004). Data mining tools See5 and C5. 0. Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), 1-47. Tanagra-A Free Data Mining Software for Teaching and Research. Retrieved 10, October, 2013, from http://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.html Tauritz, D. R., Kok, J. N., & Sprinkhuizen-Kuyper, I. G. (2000). Adaptive information filtering using evolutionary computation. Information Sciences, 122(2), 121-140. Torkkola, K. (2004). Discriminative features for text document classification. Formal Pattern Analysis & Applications, 6(4), 301-308. Vapnik, V. (2000). The nature of statistical learning theory: springer. Wahbeh, A. H., Al-Radaideh, Q. A., Al-Kabi, M. N., & Al-Shawakfa, E. M. (2011). A comparison study between data mining tools over some classification methods. International Journal of Adv anced Computer Science and Applications, Special Issue, 18-26. Wei, C.-P., Lin, Y.-T., & Yang, C. C. (2011). Cross-lingual text categorization: Conquering language boundaries in globalized environments. Information Processing & Management, 47(5), 786-804. Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., . . . Philip, S. Y. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1-37. Xia, F., Jicun, T., & Zhihui, L. (2009). A text categorization method based on local document frequency. Paper presented at the Fuzzy Systems and Knowledge Discovery, 2009. FSKD''09. Sixth International Conference on. Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. Paper presented at the ICML. Yi, K. (2006). Challenges in Automatic Classification using Library Classification Schemes. Paper presented at the. World Library andI nformation Congress: 72ndIFLA General Conference and Council, Seoul. Zhang, M.-L., & Zhou, Z.-H. (2007). ML-KNN: A lazy learning approach to multi-label learning. Pattern recognition, 40(7), 2038-2048. Zheng, Z., Wu, X., & Srihari, R. (2004). Feature selection for text categorization on imbalanced data. ACM SIGKDD explorations newsletter, 6(1), 80-89. 中央研究院語言所. 中央研究院現代漢語標記語料庫版簡介. Retrieved 2月21日, 2015, from http://app.sinica.edu.tw/cgi-bin/kiwi/mkiwi/kiwi.sh 王棯志, & 張俊盛. (2001). 適應性文件分類系統. Paper presented at the 第十四屆計算語言學研討會. 林昕潔. (2006). 以 SVM 與詮釋資料設計書籍分類系統. (碩士), 國立交通大學, 新竹市. 國家教育研究院. (2012). 雙語詞彙、學術名詞暨辭書資訊網. Retrieved 7月16日, 2014, from http://terms.naer.edu.tw/detail/1678994/ 陳光華, 羅思嘉, & 林純如. (2002). 圖書資訊學學術期刊文獻主題編目一致性之探討. 陳信源, 葉鎮源, 林昕潔, 黃明居, 柯皓仁, 楊維邦, & 圖書館. (2009). 結合支援向量機與詮釋資料之圖書自動分類方法. 資訊科技國際期刊, 3(1), 2-21. 曾元顯. (2002). 文件主題自動分類成效因素探討. 68, 22. 曾元顯. (2012). 圖書館學與資訊科學大辭典. from http://terms.naer.edu.tw/detail/1679007/ 曾綜源, & 吳俊儀. (2008). 文件內容來源對文件分類之績效評估. Paper presented at the 2008數位科技與創新管理研討會. 曾憲雄, 蔡秀滿, 蘇東興, 曾秋蓉, & 王慶堯. (2005). 資料探勘: 旗標出版股份有限公司. 黃純敏. ( 2002). 學術論文自動分類技術研討. 行政院國家科學委員會專題研究計畫成果報告(計劃編號:NSC90-2416-H-224-016). 黃嘉宏. (2008). 基於自動分類為基礎的圖書題名特徵擷取之研究-以輔助圖書分類系統為例. (碩士), 天主教輔仁大學, 台北縣. 蔡永橙, 黃國倫, & 邱志義. (2007). 數位典藏技術導論: 國立臺灣大學出版中心.
摘要: 
圖書資料分類編目作業,為各級圖書館經營管理的核心,亦是最重要的基礎工作;例行性的分類編目事務,便是由館員依文意與內容主旨,決定該館藏所屬類別。但是國內的圖書館館員多半為圖書資訊領域背景,卻必須負責所有到館圖書的編目,因此常常有因為學科背景不足,造成分類困難的情形。再加上,近年各個學科領域皆有長足進步,圖書出版的數量大幅度增加,造成編目館員負擔日益沉重,除了影響新進館藏之上架時程外,更容易因為受到主觀認知差異性的影響,導致產生inter-consistency和intra-consistency一致性低落等編目品質問題。
本研究探討傳統單層式圖書分類系統的作法,並結合多種分類器的優點,提出使用投票策略之多層式圖書自動分類系統。為了探討多層式圖書分類系統的效能,分別使用兩種語料集(博碩士論文、網路書店書目)及其對應至圖書分類號的資料,作為訓練與測試語料。同時,針對博碩士論文的文件內容,探討各種內容組合對於文件特徵值擷取的影響後,找出應用於圖書自動分類之最佳內容組合。另外,針對各種分類器的組合,進一步探討多層式圖書分類器的最佳組合。最後,實驗結果證實,多層式圖書分類系統的正確率達99%,比傳統的單層式圖書分類系統,具有更佳的分類效能。

Cataloging books are the kernel and foundation of the management for the library at all levels. Most of librarians only understand the knowledge of the library information sciences, but they are responsible for bibliography of the knowledge fields. Due to lack of background knowledge the bibliography becomes more and more difficult for the librarians. Moreover, as the recent repid achievement in every knowledge field the amount of publishing increases very quickly, the bibliography load further increases. The good quality of bibliography cannot be maintained such as high inter-consistency and high intra-consistency of library classification.
Thus, this paper deals with issues of traditional one layered book classification systems and employs the advantages of various classifiers to propose a two layered book classification system using voting strategy. Moreover, the collection of dissertations from National Chung Hsing University and the bibliographies of network bookstore are used as the training and test corpus. The classification codes of each dissertation are employed as the gold standard as well. Each dissertation contains various content parts such as title, authors or cited papers et al. On the one hand, to understand the classification effect of all the combinations of content parts, various combinations are studied as well and the best combination is recommended. On the other hand, to obtain the best classification performance, the combination of classifier for multi-layered book classification system is studied and the best combination is also recommended as well. Finally, the experimental results show that the performance of the proposed multi-layered book classification system outperforms the traditional one layered book classification systems.
URI: http://hdl.handle.net/11455/89055
其他識別: U0005-2008201514072000
Rights: 同意授權瀏覽/列印電子全文服務,2018-08-25起公開。
Appears in Collections:圖書資訊學研究所

Files in This Item:
File Description SizeFormat Existing users please Login
nchu-104-7097014013-1.pdf3.25 MBAdobe PDFThis file is only available in the university internal network    Request a copy
Show full item record
 

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.