中文版 | English

CNN-OCR Can Correctly Recognize 90% of Characters of Chinese Rare Books. ASCDC Received the TANET 2017 Honorable Mention Award!

Posted on: 2017/12/22
Posted by: Academia Sinica Center for Digital Cultures

 

 

"The 23rd Taiwan Academic Network Conference" (TANET 2017) was held October 25-27, 2017 at Tunghai University, Taiwan. ASCDC presented the "Exploring Factors Affecting the Degree of Optical Character Recognition’s (OCR) Accuracy" paper which has won the Honorable Mention Award at the TANET conference.

 

Established in 1995, the TANET is one of the largest and most influential academic conferences on Information and Network. The theme of the conference this year was "Artificial Intelligence, Big Data, and Collaboration (ABC) in Next Generation Networks", which explores how do artificial intelligence and machine learning use massive data in areas like education, internet, finance, and medical care, and create tremendous values. Meanwhile, the Conference has examined issues such as cloud computing, TANet 100G fiber-optic internet application and service, information safety, digital divide, and the Internet of Things. Approximately 350 papers were presented.

 

"Exploring Factors Affecting the Degree of Optical Character Recognition’s (OCR) Accuracy" stood out from many other excellent papers and was awarded as the Honorable Mention Paper at the Conference. The paper was co-written by Mr. Hsiang-An Wang, the chief engineer of the ASCDC, and Mr. Kung-Yu Su and Mr. Yu-Hsien Wu, the interns from the Department of Computer Science and Engineering, Yuan Ze University, Taiwan. The objective of the paper is to develop an Optical Character Recognition (OCR) software, by using the Convolutional Neural Network (CNN) technology in training a neural network model which can recognize Chinese characters in Chinese Rare Books. Such a software can automatically recognize the Chinese characters in an image and enhance the process of digitizing ancient texts. 

 

CNN is a new type of deep learning technology and has been widely adopted in image identification, video content analysis, natural language processing, drug discovery, and even the artificial intelligence (AI) program "AlphaGo". This is the first ever that the ASCDC uses CNN technology for training a Chinese character recognition software to recognize images of the characters of Chinese rare books. From the primary results, the accuracy of OCR can reach around 90%, our software shows the best recognition rate compare to other commercial word identification software.

 

For experimental purposes, ASCDC mainly focused on important Chinese medical books, "Ben Tsau Shu Gou Yuan" and "Jing Yue Quan Shu" from the Scripta Sinica database, which is constructed by the Institute of History and Philology, Academia Sinica. The CNN-OCR has won high affirmation and recognition by conference reviewers. They are looking forward to see more OCR training, to reduce the cost of metadata establishment for institutions and to facilitate the development of digital humanities.

 

The 2017 TANET has featured 54 sessions, with keynote speakers, Mr. San-Cheng Chang, the former premier of the Executive Yuan of Taiwan, and Mr. Wei-Hsin Sun, the director of the National Museum of Natural Science in Taiwan, along with other keynote speakers, presenting 5 keynote speeches, 7 lectures on new technology and 700 plus academic researchers and industry experts. 

 

You may download the file: "Exploring Factors Affecting the Degree of Optical Character Recognition's (OCR) Accuracy"

 

 

 

 

Back to News List

 

Facebook RSS


 

Subscribe RSS