Automated labeling and information extraction of Chinese corpus is an indispensable step in digital collection and research, such as the part-of-speech tag of people, events, time, place, objects or words, which is a rich research material for humanities researchers. The manual labelling process is time consuming and laborious, so we want to develop automated information annotation and retrieval systems. Through the cooperation of the Institute of Information Science, the Institute of History and Philology (aka. IHP), and the Institute of Linguistics, we will establish a Chinese-language labelling and information acquisition platform, which not only provides a big data environment for humanities and Chinese, but also adds high-quality value to big data for the scholars to conduct various types of analysis and research.
This is a five-year research project and 2019 is the third year to continue to cooperate with the sub-project No. 5, “學術創新數位深耕計畫” and sub-project No. 4, "漢籍全文資料庫," of Institute of History and Philology in 2018, to improve the manual labelling efficiency of the Chinese electronic document database object labeling system, and closely cooperate with the official language collaborator--TTS, and add the score value of the proper noun identification technology under the existing labelling interface of the IHP. Another important goal in 2019 is to automatically learn from the Chinese texts a large number of high-quality authority files. (In 2017, the first year of the project, our goal is to identify the type of existing authoritative vocabulary. In 2019, we expect to further automate the production of new, unrecognized authoritative vocabulary, which is the authoritative file.)
In addition to identify the authoritative vocabulary, in 2019, we also propose a new research topic with the team of Prof. Liu, Cheng-Yun, of IHP, of establishing a triplet knowledge map based on the ternary relationship. That means when the proper nouns are automatically recognized in Chinese texts, the relationship between these names, place names, official names, and organization names can be automatically linked, for example, knowing who is holding any official position. In 2019, we will first research and develop the pairing of names and official names, that is, automatic establish the most critical resume data in the authoritative file. At the same time, we will also explore the fact that once you have a large number of new resumes (it can be expected that the official position should not be high, because senior officials usually have been manually included) I hope to produce new and creative humanities academic topics such as the track of career. It will definitely be very different. In 2019, we will continue to improve the knowledge ontology, word segmentation and part-of-speech tagging. The problems encountered in the past two years are analyzed and counter measures are proposed. For details, please refer to the following chapters.