Automated labeling and information extraction of Chinese corpora is an indispensable step in digital collection and research, such as the part-of-speech tagging of people, events, times, places, objects or words, which provides rich research material for humanities researchers. The manual labelling process is time-consuming and laborious, so we want to develop automated information annotation and retrieval systems. Through the cooperation of the Institute of Information Science, the Institute of History and Philology (IHP), and the Institute of Linguistics, we will establish a Chinese-language labelling and information acquisition platform, which not only provides a big data environment for humanities and Chinese, but also adds high-quality value to big data for scholars to conduct various types of analysis and research.
2020 is the fourth year of this five-year research project. We continue to cooperate with IHP's sub-project no. 5, “Project to Digitally Innovate Academic Settings” and sub-project no. 4, "Scripta Sinica Database." In 2017, the first year of the project, our goal was to identify the types of existing authority term vocabulary. In 2018, to improve the manual labelling efficiency of the Chinese electronic document database object labeling system, we worked closely with official language collaborator TTS (大鐸資訊), and added score values to the proper noun identification technology under the existing labelling interface of the IHP. In 2019, the goal was to automate the recognition and production of new authority file vocabulary from new Chinese texts to produce a large number of high-quality authority files. In addition, we also proposed a new research topic with Prof. Liu Cheng-Yun's team at IHP, of establishing a triplet knowledge map based on ternary relationships. Through this technique, when proper nouns are automatically recognized in the Chinese texts, the relationship between these names, place names, official names, and organization names can be automatically linked to quickly determine, for example, who is holding any official position. In 2019, we automatically established the most critical resume data in the authority file using generative models.
In 2020 (the fourth year), we intend to further develop from the generative model to the discriminative model, so that deep learning techniques can make full use of relationship contexts and perform encoding operations to extract the relations. We also aim to expand the types of relationships. At the same time, we also hope to explore the trajectories of career paths using the large amount of new resume data (they are expected to be automatically indexed by the authority file, however, this should not include many high official positions, since senior officials have already been manually indexed). Our proposed mechanism can provide another interesting perspective.