Effective tagging and data retrieval is indispensable to digital archiving and academia research. Identifying the semantic significance of people, event, time, place and object can empower research in the humanities by better navigating in depository of terms. However, manual tagging is time-consuming and painstaking. The objective of this project is to develop an automatized tagging and retrieval system. In the 5-year time frame, the goal for 2018 will be constructing an ontology for the terminology of Chinese texts and a device that recognizes the terminology without the input of pedagogical corpus. In the meantime, we are transferring a text segmentation and terminology recognition program to the IT team of the Academia Sinica Center for Digital Cultures (ASCDC), so that a platform designated for processing the language of Chinese texts can be built. It will be granted to colleagues working in every department of ASCDC in 2018.
Existent manually-tagged corpus will be an important guidance for the automatic tagging of terminology including names of people, places, government positions and medicinal drugs. Whereas there is no available manually-tagged terminology for terms of medicinal drugs, we intend to extract one from the canonical Compendium of Materia Medica and obtain contextual information for the terms from 127 ancient medical books provided by the Institute of History and Philology. The recognition model will be set in Conditional Random Field, following Statistical Principle-based Approach (PBA) to shape its recognizing patterns. In the first year, viz. 2015, we imparted the model the terms for people, places, government positions and medicinal drugs in the New Book of Tang (Xin Tangshu), Old Book of Tang (Jiou Tangshu), Historical Records of the Five Dynasties (Wudai Shiji) and Old History of the Five Dynasties (Jiu Wudai Shi). As a result, it can reach approximately 90% in f-score.
In 2018 we intend to build an ontology that can unify all the terminology of the Chinese Corpus by allowing entry to every specific term with well-defined semantic categories and collocational relations. And through internal comparison of the ontology and learning the term vectors we plan to add the tagging of diseases, symptoms and etiology. Because of the lack of existing manually-tagged vocabulary in this regard, we have to overcome this deficiency by developing a recognition technique without the help of corpus knowledge. The challenge of this attempt will consist of the newly-added vocabulary types and the lack of vocabulary for training. We consider distant supervision as one of the possible solutions.
We would like to concentrate on tagging terms of medicinal drugs, diseases, symptoms and etiology and supporting the interconnections of the terms in the third year. With the aid of manually-tagged corpus, we will be able to locate the semantic relations of the terms.
Whereas the recognition program should become fairly mature in the fourth year of this project, we look forward to expanding the types of terminology by means of bootstrapping. At the same time, big data analysis of the Chinese corpus will be ready to be applied to investigate the habit of prescription in different dynasties.
The fifth year, we hope, will see a tighter interconnection between sets of terminology, and the time shall be ripe for turning the program into a big data analysis platform of which every scholar and research may avail themselves.
As the Institute of Information Science, Institute of History and Philology and Institute of Linguistics are brought on the same page to carry out this project, we hope that our information platform of Chinese texts will not only create a big data channel for humanities scholars, but also open up possibilities for value-added applications facilitating various analyses.
In Chinese medicine, for instance, when all the terms of medicinal drugs, diseases, symptoms, etiology and interrelations of all of the above are clearly located in a single, unifying web, scholars will be able to embark on a diachronic examination of the use of medicinal drugs with accurate statistics and cross-references. They may go on to assess whether the dosage and style of prescription assume a pattern of variation in terms of time, region, authors and physicians. The powerful analytic tool may offer some clue for advancing modern Chinese medicine as well.
In the short run, we are working on an online interface for people to find corpora that are already tagged by the Institute of History and Philology and the Institute of Linguistics by integrating their resources. This will eventually lead to an all-embracing database of Chinese texts complying to the international standard of specifications, hence presenting itself as available to every researcher around the globe who can in turn offer valuable feedback and improve the system.