Please use this identifier to cite or link to this item: http://hdl.handle.net/11455/17772
標題: 表格辨識與表格資料擷之研究
A study on form document recognition and data extraction
作者: 陳榮靜
Chen, Rung Ching
關鍵字: form recognition
表格辨識
document analysis
text-only form
data extraction
character segmentation
dynamic programming
文件分析
無線表格
資料擷取
文字民割
動態規劃
出版社: 應用數學系
摘要: 摘要 表格文件不管在各公司行號、政府機關或機構內部間相互聯繫,都是一種 很常見的文件。傳統上,都是以人工方式讀取表格文件上的資訊,再輸入 到電腦。然而以人工方式擷取表格資訊成本高,如何自動化的處理已是辦 公室自動化的重要課題。一般表格包含有表格線,但我們也發現,部份中 文式表格竟沒有表格線。所以,在表格處理上,我們針對有表格線表格和 沒有表格線表格,各提出一套表格辨識方法。再者,當表格辨識完成,我 們擷取所須的文字區塊,而這文字區塊,必須再被切割成字元,以便送給 文字辨識處理系統。本篇論文針對辦公室自動化中的表格式文件辨識、表 格資料擷取及中文手寫字串的切割處理,提出一套有效的解決方法。大多 數表格式文件有線的存在。論文中提出一種有效的方法,來辨識有線的表 格。此方法是基於一種表格線表示模型,在此模型基礎下,來學習空白表 格,辨識填入資料的表格並擷取表格資料。此模型是以三種不同的表格線 來表示一個表格,而所有的表格線都經過正規化和排序,這正規化和排序 不只解決了表格的放大縮小時的比對問題,而且提供一有效率的比對方式 。為了使表格辨識更具有彈性,我們採用了模糊比對的方式。且對於傾斜 的表格,我們也只須處理部份點的旋轉,以提升表格辨識的效率。對於沒 有表格線的無線表格,無法以表格線為特徵建立辨識模型。無線表格的外 形,會隨填入的文字而改變,所以無線表格的辨識並不容易。面對這一類 表格,論文中提出另一種有效的方法,它利用文字區塊和字元個數來加以 辨識。在我們系統,空白的無線表格以文字區塊和字元數為其特徵,經由 學習,放入無線表格資料庫中。然後填入資料的無線表格,經由二階段的 比對,以迅速找出表格資料庫中對應的表格。對於無線表格放大縮小時的 比對問題,系統也經由表頭文字區塊比對加以解決。同時,為了使表格辨 識更具有彈性,採用了模糊比對的方式。當表格辨識完成,我們擷取所填 入的文字區塊,而這文字區塊,必須再被切割成字元,以便文字辨識處理 。中文字切割是中文字辨識的前置處理工作,正確的文字辨識,有賴於正 確的切割處理。對於中文手寫字,因為書寫上,可能造成字和字間的接觸 或重疊。因此,中文手寫字切割,受制於書寫習慣有很大的變異性,是一 個困難的問題。論文中提出一個新的方法來處理中文手寫字切割。首先, 利用筆段來建立筆段方塊,再以知識為導向,進行筆段方塊合併的工作, 最後,利用動態規劃的方法,找出適當的切割點,以進行中文手寫字切割 。以上包括有線表格辨識、無線表格辨識以及中文手寫字的切割,我們做 了一系列的實驗,這些實驗證實,論文中所提的作法是一個有效而可行的 方法。
Abstract The form documents are widely used in daily work, especially when people go to companies or government departments for business. The formdocuments are also used to transfer information between departments of anorganization. For processing the filled-in forms, the traditional method is to read the information and then key into the computer by people. Yet, the manualprocess is very costly. How to automatically process the filled-in form documents is important in office automation. Most of the form document contain line segments, however there are still some forms that have no line segments. In order to process all the form documents, we proposed in this dissertation two methods for recognizing the forms with line segments and the forms without line segments. The first method is to recognize the form document that contains at least one line segment. Our method is based on an efficient representation model ofthe form. The representation model uses three types of line segments to represent a form. All line segments are normalized and sorted after they were extracted. The normalization and sorting not only solve the problems ofmagnification and contraction but also provide an unified and efficient way ofmatching between forms. To make the recognition method more obust, a fuzzy matching is used. Methods for processing skew forms are also proposed so thatonly some pixels need to do rotation transformations. The second method is to recognize form documents that have no line segments,namely the text-only form documents. For the text-only form documents, theshapes of the form will be changed when the characters are filled into the form.Recognizing the text-only form documents is not an easy one due to the changesof shapes in the filled-in form. Our method is based on the text blocks and thecharacter boxes. The problem of recognizing magnified and contracted forms issolved by matching the title text block beeen the filled-in form and the blankform in the library. We also use fuzzy matching to make the recognition morerobust. After a form is recognized, the text block can then be extracted. The textblock must be segmented into characters before the character recognition methodscan be applied. The segmentation is an important preprocess of the off-lineChinese character recognition because corrects recognition of characters relieson correct segmentation of characters. In this thesis, we also present a method which uses strokes to build stroke bounding boxes first, then, the knowledge-based merging operations are used to merge those stroke bounding boxesand finally, a dynamic programming method is applied to find the best segmentation boundaries. A series of experiments had been conducted. The experiments show that ourmethods are very effective for form recognition and off-line handwrittenChinese character segmentation.
URI: http://hdl.handle.net/11455/17772
Appears in Collections:應用數學系所

文件中的檔案:

取得全文請前往華藝線上圖書館



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.