A dataset of geographic entities and relationships from Song Dynasty texts on Lin'an
Sci Data. 2026 May 30. doi: 10.1038/s41597-026-07527-2. Online ahead of print.
ABSTRACT
The automatic extraction of geographical entities and spatial relationships from historical texts is a fundamental task for Named Entity Recognition (NER) and relation extraction (RE), with important implications for historical geography and digital humanities. Classical Chinese documents describing ancient cities pose particular challenges due to archaic language, implicit spatial expressions, and complex entity hierarchies. In this study, we present a manually annotated dataset designed for joint geographical entity and spatial relationship extraction from texts related to Lin'an, the capital of the Southern Song Dynasty. The dataset consists of 18 in-domain and 1 out-of-distribution historical documents comprising approximately one million Chinese characters, annotated with 24 categories of geographical entities and 34 types of spatial relationships. This dataset provides a valuable resource for advancing NER and spatial relation extraction in historical texts and supports future research in historical Geographic Information Systems (GIS), cultural geography, and digital heritage reconstruction.
PMID:42225712 | DOI:10.1038/s41597-026-07527-2