中文版 | English

Information Extraction for Ancient Chinese

Basic information
Project identifier AS-ASCDC-110-202
Conducted by Institution of Information Science
Director
Overview

Automated labeling and information extraction of Chinese corpora – such as part-of-speech tagging of people, events, time, place, objects or words – is an indispensable step in digital collection and research, which provides rich material for humanities researchers. The manual labelling process is time consuming and laborious, so we want to develop automated information annotation and retrieval systems. Through the cooperation of the Institute of Information Science, the Institute of History and Philology (IHP), and the Institute of Linguistics, we will establish a Chinese-language labelling and information acquisition platform, which not only provides a big data environment for humanities and Chinese, but also adds high-quality value to big data for scholars to conduct various types of analysis and research.

2021 is the fifth year of a five-year research project, continuing its cooperation since 2019 with IHP's "Project to Digitally Innovate Academic Settings" and "Scripta Sinica Database" subprojects. In 2017, the first year of the project, our goal was to identify categories in existing authority terms. In 2018, working closely with our official language collaborator TTS, we developed API and tools to assign confidence scores to word segmentation markups. From 2019-2020, we also proposed a new research topic with Prof. Liu Cheng-yun's team from IHP, concerning the establishment of knowledge graphs based on triplet relationships, so that when names of proper nouns are automatically recognized in Chinese texts, the relationship between such names of people, places, officials, or organizations can be automatically linked, for example, automatically connecting which name is holding which official position. In 2019, we began developing generative models to automatically produce authority files containing the most critical career data for names of people and official positions in a text. In 2020, we further developed the triplet-based generative model to construct new career data that paired names of people and official positions, thereby identifying new authority terms.

In 2021, we intend to incorporate more corpus data to deduce more triplets, discover new names and official positions, and assist IHP in cleaning metadata and establishing official career chronologies. Using IHP's feedback, we will continuously improve the algorithm and assist in debugging and cleaning existing authority files. In addition, through our technology, we can automatically analyze what someone did after becoming an official, or where someone went, or even know what certain things someone did after becoming an official at a certain place. By improving details in such descriptions and characterizations of historical figures, we can provide humanities scholars with a large amount of material for various analyses.

In the project's final year, we also hope to continue the integration of all our developed technologies onto ASCDC's Digital Humanities Research Platform. Aside from ongoing improvements to previously integrated tools for automated ontology establishment, word segmentation, and part-of-speech tagging, we hope to culminate in 2021 with the integration of the automated relationship extraction tool onto the platform.

Find out more Lexical Semantics with Ontology Academia Sinica Tagged Corpus of

Back to Project List

 

Facebook RSS


 

Subscribe RSS