Please use this identifier to cite or link to this item: http://hdl.handle.net/11455/98379
標題: 高吞吐量深度學習AI加速器之硬體實現
High Throughput Hardware Implementation for Deep Learning AI Accelerator
作者: 賴彥齊
Yen-Chi Lai
關鍵字: 高吞吐量;加速器;神經網路;深度學習;High Throughput;Deep Learning;AI;Accelerator
引用: [1] Y. LeCun, B. Boser, J. S. Denker; D. Henderson; R. E. Howard; W. Hubbard; L. D. Jackel, 'Backpropagation Applied to Handwritten Zip Code Recognition,' Neural Computation., vol. 1, no. 4 pp. 541-551, 1989. [2] Hinton, Geoffrey E., Simon Osindero, and Yee-WhyeTeh. 'A fast learning algorithm for deep belief nets.' Neural computation 18.7 (2006): 1527-1554. [3] Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. 'Reducing the dimensionality of data with neural networks.' Science 313.5786 (2006): 504-507. [4] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 'Imagenet classification with deep convolutional neural networks.' Advances in neural information processing systems. 2012. [5] E. Shelhamer, J. Long, T. Darrell. 'Fully Convolutional Networks for Semantic Segmentation,'IEEE Transactions on Pattern Analysis and Machine Intelligence., vol.39 no.4 pp. 640-651, 2017. [6] Y. LeCun, P. Haffner, L. Bottou, Y. Bengio, 'Object Recognition with Gradient-Based Learning', Feature Grouping, 1999. [7] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, 'Eyeriss: An energy efficient reconfigurable accelerator for deep convolutional neural networks,'IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, Jan 2017. [8] Li Du; Yuan Du; Yilei Li; JunjieSu; Yen-Cheng Kuan; Chun-Chen Liu; Mau-Chung Frank Chang, 'A Reconfigurable Streaming Deep Convolutional Neural Network Accelerator for Internet of Things,'IEEE Transactions on Circuits and Systems I: Regular Papers, vol.65, no. 1, pp.198-208,2018. [9] Belongie, S., Malik, J., and Puzicha, J. (2002). 'Shape matchingand object recognition using shape contexts.' IEEE Transactions on Pattern Analysis and Machine Intelligence,24(4):509–522. [10] Carreira-Perpinan, M. A. and Hinton, G. E. (2005). 'On contrastivedivergence learning. 'Artificial Intelligence and Statistics, 2005. [11] Decoste, D. and Schoelkopf, B. (2002). 'Training invariant support vector machines. 'Machine Learning, 46:161–190. [12] Freund, Y. (1995). 'Boosting a weak learning algorithm by majority. 'Information and Computation, 12(2):256 – 285. [13] Hinton, G. E. (2002). 'Training products of experts by inimizing contrastive divergence. 'Neural Computation,14(8):1711–1800. [14] Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. (1995). 'The ake-sleep algorithm for self-organizing neural networks. 'Science, 268:1158–1161. [15] LeCun, Y., Bottou, L., and Haffner, P. (1998). 'Gradient-based earning applied to document recognition. 'Proceedings of he IEEE, 86(11):2278–2324. [16] Lee, T. S. and Mumford, D. (2003). 'Hierarchical bayesian inference in the visual cortex. 'Journal of the Optical Society of America, A., 20:1434–1448. [17] L. Breiman. 'Random forests. Machine learning, '45(1):5–32, 2001. [18] D. Cire¸san, U. Meier, and J. Schmidhuber. 'Multi-column deep neural networks for image classification. 'Arxiv preprint arXiv:1202.2745, 2012. [19] D.C. Cire¸san, U. Meier, J. Masci, L.M. Gambardella, and J. Schmidhuber. 'High-performance neural networks for visual object classification. 'Arxiv preprint arXiv:1102.0183, 2011. [20] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 'ImageNet: A Large-Scale Hierarchical Image Database. 'In CVPR09, 2009. [21] L. Fei-Fei, R. Fergus, and P. Perona. 'Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. 'Computer Vision and Image Understanding,106(1):59–70, 2007. [22] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov. 'Improving neural networks by preventing co-adaptation of feature detectors. 'arXiv preprint arXiv:1207.0580, 2012. [23] K. Jarrett, K. Kavukcuoglu, M. A. Ranzato, and Y. LeCun. 'What is the best multi-stage architecture for object recognition? 'In International Conference on Computer Vision, pages 2146–2153. IEEE, 2009. [24] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. 'Semantic segmentation with second-order pooling.' In ECCV,2012. [25] D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. 'Deep neural networks segment neuronal membranes in electron microscopy images.' In NIPS, pages 2852–2860,2012. [26] D. Eigen, D. Krishnan, and R. Fergus. 'Restoring an image taken through a window covered with dirt or rain.' In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 633–640. IEEE, 2013. [27] D. Eigen, C. Puhrsch, and R. Fergus. 'Depth map prediction from a single image using a multi-scale deep network. 'arXiv preprint arXiv:1406.2283, 2014. [28] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. 'Learning hierarchical features for scene labeling. 'Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2013. [29] K. Simonyan and A. Zisserman, 'Very deep convolutional networks for large-scale image recognition,' CoRR, vol. abs/1409.1556, pp. 1–14,Sep. 2014. [30] C. Szegedy et al., 'Going deeper with convolutions,' in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9. [31] K. He, X. Zhang, S. Ren, and J. Sun, 'Deep residual learning for image recognition,' in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016. [32] R. Girshick, J. Donahue, T. Darrell, and J. Malik, 'Rich feature hierarchies for accurate object detection and semantic segmentation,' in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2014, pp. 580–587. [33] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, 'OverFeat: Integrated recognition, localization and detection using convolutional networks,' CoRR, vol. abs/1312.6229, pp. 1–16, Dec. 2013. [34] R. Hameed et al., 'Understanding sources of inefficiency in generalpurpose chips,' in Proc. 37th Annu. Int. Symp. Comput. Archit., 2010, pp. 37–47. [35] M. Sankaradas et al., 'A massively parallel coprocessor for convolutional neural networks,' in Proc. 20th IEEE Int. Conf. Appl.-Specific Syst., Archit. Process., Jul. 2009, pp. 53–60. [36] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, 'A dynamically configurable coprocessor for convolutional neural networks,' in Proc. 37th Annu. Int. Symp. Comput. Archit., 2010, pp. 247–257. [37] M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal, 'Memorycentric accelerator design for convolutional neural networks,' in Proc. IEEE 31st Int. Conf. Comput. Design (ICCD), Oct. 2013, pp. 13–19. [38] T. Chen et al., 'DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,' in Proc. 19th Int. Conf. Archit. Support Program. Lang. Oper. Syst., 2014, pp. 269–284. [39] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, 'Deep learning with limited numerical precision,' CoRR, vol. abs/1502.02551, pp. 1–10, Feb. 2015. [40] https://hk.saowen.com/a/50f88cac79b5b08cb59fcb8c707c07ec626b79193ea19cdfae3169f9992984c0 [41] https://zh.wikipedia.org/wiki/Intel_Tick-Tock [42] http://www.ni.com/white-paper/6983/zht/ [43] https://www.actualtech.io/programmable-asics-will-change-infrastructure-investments/ [44] Q. V. Le, M. A. Ranzato, R. Monga, M. Devin, K. Chen, G. S.Corrado, J. Dean, and A. Y. Ng. 'Building High-level Features Using Large Scale Unsupervised Learning.' In International Conference on Machine Learning, June 2012. [45] P. Merolla, J. Arthur, F. Akopyan, N. Imam, R. Manohar, and D. Modha. 'A digital neurosynaptic core using embedded crossbar memory with 45pJ per spike in 45nm.' In IEEE Custom Integrated Circuits Conference, pages 1–4. IEEE, Sept.2011. [46] S. Yehia, S. Girbal, H. Berry, and O. Temam. 'Reconciling specialization and flexibility through compound circuits.' In International Symposium on High Performance Computer Architecture,pages 277–288, Raleigh, North Carolina, Feb. 2009.Ieee. [47] R. J. Vogelstein, U. Mallik, J. T. Vogelstein, and G. Cauwenberghs. 'Dynamically reconfigurable silicon array of spiking neurons with conductance-based synapses.' IEEE Transactions on Neural Networks, 18(1):253–265, 2007. [48] G. Venkatesh, J. Sampson, N. Goulding-hotta, S. K. Venkata,M. B. Taylor, and S. Swanson. QsCORES : 'Trading Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores Categories and Subject Descriptors.' In International Symposium on Microarchitecture, 2011. [49] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio. 'Robust object recognition with cortex-like mechanisms.' IEEE transactions on pattern analysis and machine intelligence,29(3):411–26, Mar. 2007. [50] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen,and N. P. Jouppi. McPAT: 'An integrated power, area, and timing modeling framework for multicore and manycore architectures.' In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages 469–480, New York, NY, USA, 2009. ACM. [51] A. A. Maashri, M. Debole, M. Cotter, N. Chandramoorthy,Y. Xiao, V. Narayanan, and C. Chakrabarti. 'Accelerating neuromorphic vision algorithms for recognition.' Proceedings of the 49th Annual Design Automation Conference on – DAC '12, page 579, 2012. [52] W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. A. Horowitz. Convolution engine: balancing efficiency & flexibility in specialized computing. In International Symposium on Computer Architecture, 2013.
摘要: 
本論文提出一個應用於深度學習網路運算的高吞吐量硬體加速器。由於深度學習在硬體架構上從DRAM輸入資料量過高的情況下,本論文設計了一個高度Data Reuse的架構來降低直接藉由Bus向外部DRAM讀取資料的次數,並且藉由適度的管線化設計來達到高吞吐量的設計需求。
本論文所提出的架構,使用8bit 精準度的運算、128-bits匯流排並利用16套的處理單元平行處理,在125MHZ工作頻率下達到即時運算且擁有4GOPS的吞吐量,可用於大部分深度學習神經網路。

In this thesis, a high-throughput hardware accelerator for deep learning network is proposed. Since deep learning operation requires high data access from DRAM, we design a high data reuse architecture to reduce the data access directly from external DRAM and provide pipeline design to achieve high throughput design requirements. The architecture proposed in this thesis uses 8-bit precision computing, 128-bit bus, and parallel processing with 16 sets of processing units (PE) can achieve real-time operation at 125MHZ operating frequency and 4GOPS throughput, which can be used for most deep learning neural networks.
URI: http://hdl.handle.net/11455/98379
Rights: 同意授權瀏覽/列印電子全文服務,2021-08-30起公開。
Appears in Collections:電機工程學系所

Files in This Item:
File SizeFormat Existing users please Login
nchu-107-7105064300-1.pdf1.06 MBAdobe PDFThis file is only available in the university internal network    Request a copy
Show full item record
 

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.