阅读视图

Visual Exploration of a Historical Vietnamese Corpus of Captioned Drawings: A Case Study

IEEE Comput Graph Appl. 2026 Feb 2;PP. doi: 10.1109/MCG.2026.3660122. Online ahead of print.

ABSTRACT

This paper presents a case study focusing on the exploratory visual analysis of a unique historical dataset consisting of approximately 4000 visual sketches and associated captions from an encyclopedic book published in 1909-1910. The book, which offers insight into Vietnamese crafts and social practices, poses the challenge of extracting cultural meaning and narrative structure from thousands of drawings and multilingual captions. Our research aims to explore and evaluate the effectiveness of multiple visualization techniques in uncovering meaningful relationships within the dataset while working closely with professional historians. The main contributions of this study include refining historical research questions through task and data abstraction, combining and validating visualization techniques for historical data interpretation, and involving a focus group of historians for further evaluation. These contributions offer generalizable insights for the development of domain-specific visualization tools and support interdisciplinary engagement in historical data visualization and critical digital humanities research.

PMID:41628052 | DOI:10.1109/MCG.2026.3660122

  •  

A georeferenced dataset of archaeobotanical findings of <em>Olea europaea</em> and <em>Vitis vinifera</em> compiled from published records from Central Italy

Data Brief. 2026 Jan 7;64:112443. doi: 10.1016/j.dib.2025.112443. eCollection 2026 Feb.

ABSTRACT

Here we present a coherent, georeferenced and chronologically qualified corpus of fossil plant remains compiled from published archaeobotanical records from archaeological sites from Central Italy, focused on Olea europaea (olive) and Vitis vinifera (grape). The dataset is entirely based on secondary data and does not include newly generated primary archaeobotanical analyses. The dataset integrates site, context and all relevant archaeobotanical occurrences within a coherent relational and spatial model. The corpus was initiated through a structured bibliographic survey aided by the BRAIN database. Exclusively published literature was consulted, allowing to model archaeological sites and link them to excavation contexts and individual archaeobotanical occurrences (defined as the combination of a taxon and the specific plant part recovered, e.g., fruit, seed, rachis). The geodatabase was implemented using QGIS, with a local backend in GeoPackage, then migrated to PostgreSQL/PostGIS to support complex spatial/relational queries and future online outputs. All entities have a defined spatial placement accompanied by explicit quality-control parameters documenting positional uncertainty, source type and authority, as derived from the original published sources, ensuring transparent assessment of locational reliability. To enrich taxonomic information, an automated open thesaurus was built from CC BY/CC BY-SA resources (Floritaly, Acta Plantarum, and Wikimedia projects). The workflow employs REST-style access (or form-equivalent submissions), conservative rate-limiting, randomized waits, retries, and checkpoints; provenance and attribution (including noted transformations) are preserved. A standardized chronological table harmonizes relative cultural phases using ICCD nomenclature, with controlled fallbacks to Perio.do or peer-reviewed literature; a self-referential hierarchy (parent_id) ensures inheritance from sub-phase to broader period. Crucially, the use of open licenses, stable identifiers and cross-references makes the dataset interoperable and interlinked with the source ecosystems from which the secondary archaeobotanical data were extracted: records can resolve back to Floritaly and Acta Plantarum, and our forthcoming web portal can expose these connections for bidirectional navigation, automated updating and external reuse. The result is an interoperable, verifiable resource suitable for spatial and temporal analyses of plant remains based on aggregated and standardized published archaeobotanical data, while remaining legally reusable under the original licenses.

PMID:41624435 | PMC:PMC12855569 | DOI:10.1016/j.dib.2025.112443

  •  

Refrigerator Wisdom: Social Rules and Rights as a Conceptual Framework for Digital Ethics

Is it possible that examining the rules which govern our use of an everyday appliance can assist data practitioners in collecting and using data in ethical and responsible ways? This inquiry is based on a theory that current and emerging frameworks for data ethics share a palpable synergy with social rules that govern our use of everyday technologies, like the household refrigerator. [...]

  •  

Data Supremacy: Race In-Formation Through Herman Hollerith’s Tabulating Machine

In this essay, I examine how data was racialized in the United States through the Tabulating Machine, developed by the German American inventor Herman Hollerith in the 1880s to automate census tabulation. During that time, scientific theories of race posited racial categories to be biologically distinct and hierarchized, thereby defining “whiteness” as a favoured trait. As such, the [...]

  •  

Preface to the Proceedings of RAIL 2025

The sixth workshop on Resources for African Indigenous Languages (RAIL) was held on 10 November 2025 at the CSIR International Convention Centre in Pretoria, South Africa. It was co-located with the Digital Humanities Association of Southern Africa (DHASA) 2025 conference, which took place from 11 to 14 November 2025

  •  

Siswati Part of Speech Tagger: A Quantitative Evaluation

This article evaluates the performance of the Siswati Text Annotation Tool part of speech (STAT POS) tagger using Recall, Precision and F1 score metrics. A quantitative research design was adopted for analysis, and data was collected through purposive sampling. Python 3 was utilised to calculate the Recall and Precision of the STAT POS tagger outputs. The results show that the Recall for nouns was 0.761, Precision 0.417, with an F1 score of 0.54; for verbs, the Recall was 0.756, Precision 0.798 and F1 score 0.54; for adverbs, the Recall was 0.571, Precision 0.8, and F1 score 0.67; for possessives, the Recall was 0.963, Precision 0.813 and F1 score 0.88. For relatives (REL), the Recall was 0.706, Precision 0.523, and the F1 score 0.60; for class-indicating demonstratives, the Recall was 0.333, Precision 0.25, and the F1 score 0.29; and for copulatives (COP), the Recall was 0.75, Precision 0.75, and the F1 score 0.75. For conjunctions, the Recall was 0.85, the Precision was 0.68, and the F1 score was 0.76; for pronouns, the Recall was 0.563, the Precision was 1.0, and the F1 score was 0.72; for adjectives, the Recall was 0.75, the Precision was 0.75, and the F1 score was 0.75. However, question words, interjections and ideophones received 0.0., highlighting the need for refinement of the STAT POS tagger.

  •  

Ideational Analysis and Integration of African Folktale in Science, Technology, and Education

Folktales are literary forms that reveal the soul of any society; they express its wishes, desires, hopes, and beliefs about the world. They have fictional characters and situations, mostly oral traditions, before they were written down. According to Cynthia McDaniel (1993), folktales can be used in all disciplines to convey knowledge and communicate ideas; they serve as an inherent vehicle for intergenerational communication that prepares and assigns roles and responsibilities to different generations in their communities. They are more pedagogic devices and less literary pieces. They cultivate universal values such as compassion, generosity, and honesty while disapproving of attributes such as cruelty, greed, and dishonesty. To illustrate McDaniel's claims, this paper will firstly use the ideational metafunctional framework found in Systemic Functional Linguistics, which expresses the clausal experiences and content from a grammatical perspective, coupled with syntagmatic analysis, which describes the text (folktale) in chronological order as reported by the storyteller. Secondly, the presentation will use a textual metafunctional framework that fulfills the thematic function of the clause, coupled with the paradigmatic analysis where the folkloristic text's patterns are regrouped more analytically to reveal the text's latent content, or theme. The Voyant Tool, a web-based text reading and analysis environment designed to facilitate the analysis of various text formats, was used to extract and analyze data from a Sesotho folktale to illustrate how folktales may be integrated with technology for research and educational purposes. This paper employed a descriptive research design that incorporates qualitative (content analysis) and quantitative (statistical analysis) methodologies to analyze and interpret the story. It is observed, through the Voyant tool, that the story is built out of 191 Sesotho word formations, and through the ideational analysis, that the storyteller employed more material process types than mental process types, and lastly, with the textual interpretation, indicating the value of oral literature in our daily lives as well as the significant role folktales may play in interpreting sociopolitical events in contemporary communities.

  •  

Building Corpora for Low-Resource Kenyan Languages

Natural Language Processing is a crucial frontier in artificial intelligence, with broad application across public health, agriculture, education, and commerce. However, due to the lack of substantial linguistic resources, many African languages remain underrepresented in this digital transformation. This article presents a case study on the development of linguistic corpora for three under-resourced Kenyan languages, Kidaw’ida, Kalenjin, and Dholuo, with the aim of advancing natural language processing and linguistic research in African communities. Our project, which lasted one year,
employed a selective crowd-sourcing methodology to collect text and speech data from native speakers of these languages. Data collection involved (1) recording and transcribing conver-
sations and translating the resulting text into Kiswahili, creating parallel corpora, and (2) reading and recording written texts to generate speech corpora. We made these resources
freely accessible on open-research platforms, namely Zenodo for the parallel text corpora and Mozilla Common Voice for the speech datasets, thereby facilitating ongoing contributions and
developer access to train models and develop Natural Language Processing applications.

  •  

Stop words in Khoekhoe

Stop word lists are useful resources that allow for the filtering of words in texts that typically do not carry (much) content. Filtering stop words can improve the efficiency and accuracy of data processing. Stop words are typically short and occur very frequently in texts. Stop word lists are language dependent and many low-resource languages currently do not have (accurate) stop word lists. In this article, we look at how we can create, based on word frequency, a stop word list for Khoekhoe, which is a low-resource language spoken in Southern Africa. Given that stop words do not carry much content, they can be expected to occur consistently across different texts. We compare lists of most frequent words between texts in different genres and which words feature in these lists consistently. We look at the overlap of frequent words in English texts and compare these to a known English stop word list as well, and compare the results with the overlap of frequent words in Khoekhoe texts. The results show that there is a high overlap between genres for English, but the overlap between the Khoekhoe genres is lower. This may be due to a different typological profile of Khoekhoe. This means that creating a stop word list for Khoekhoe is more complicated and most likely requires other techniques to produce a useful stop word list.

  •  

Creating Bilingual Corpora for isiZulu: A Case Study from the University of KwaZulu-Natal

Although several bilingual resources exist, there is a lack of domain-specific, institutionally verified parallel corpus focusing on academic and administrative texts. Existing datasets such as Autshumato English–isiZulu corpus, UNISA English/Zulu Parallel Corpus, and the WebCrawl African Corpus hosted on GitHub provide valuable material but differ in accessibility, domain coverage, and documentation. To complement these initiatives, the University Language Planning and Development Office (ULPDO) at the University of KwaZulu-Natal has developed a curated isiZulu–English Parallel Corpus comprising 10,000 carefully aligned sentence pairs drawn from institutional and academic texts. This paper outlines the corpus compilation process, including data sourcing, cleaning, alignment, and validation, and discusses key structural and linguistic challenges encountered. The resource contributes to translation studies, terminology development, and multilingual natural language processing, while supporting ongoing efforts to advance the digital presence and intellectualisation of isiZulu.

  •  

Traditional Readability Approaches in Sesotho and isiZulu

This paper presents a conceptual overview of traditional readability metrics adapted for two South African Indigenous languages, isiZulu and Sesotho, which differ orthographically with conjunctive and disjunctive writing systems, respectively. Both languages are low-resource, lacking extensive corpora, lexicons, and pretrained models necessary for automatic readability assessment. By critically examining these adaptations, we highlight the challenges of applying English-based metrics to morphologically complex African languages and emphasise the need for language-specific digital resources that reflect local linguistic structures. Our work aligns with ongoing efforts to develop and enhance language resources for under-resourced African Indigenous languages, thereby supporting their evolving presence and accessibility in the digital age, including contexts shaped by large language models.

  •  

An exploration of the computational identification of English loan words in Sesotho

South Africa, with its twelve official languages, is an inherently multilingual country. As such, speakers of many of the languages have been in direct contact. This has led to a cross-over of words and phrases between languages. In this article, we provide a methodology to identify words that are (potentially) borrowed from another language. We test our approach by trying to identify words that moved from English into Sesotho (or potentially the other way around). To do this, we start with a bilingual Sesotho-English dictionary (Bukantswe).
We then develop a lexicographic comparison method that takes a pair of lexical items (English and Sesotho) and computes a range of distance metrics. These distance metrics are applied to the raw words (i.e., comparing orthography), but using the Soundex algorithm, an approximate phonological comparison can be made as well. Unfortunately, Bukantswe does not contain complete annotation of loan words, so a quantitative evaluation is not currently possible. We provide a qualitative analysis of the results, which shows that many loan words can be found, but in some cases lexical items that have a high similarity are not loan words. We discuss different situations related to the influence of orthography, phonology, syllable structure, and morphology. The approach itself is language independent, so it can also be applied to other language pairs, e.g., Afrikaans and Sesotho, or more related languages, such
as isiXhosa and isiZulu.

  •  

SeSoDa: A Compact Context-Rich Sesotho-English Dataset for LoRA Fine-Tuning of SLMs

We introduce SeSoDa, a multidomain Sesotho(Sa Lesotho)-English dataset of 1,966 prompt-completion pairs that span six categories (nouns, verbs, idioms, quantifiers, grammar rules, usage alerts). SeSoDa documents the morphosyntactic complexity, uncaptured Basotho cultural specificity, and orthographic/phonological differences between Lesotho and South African Sesotho. We created a user-friendly, JSON-style corpus with detailed metadata. This aims to lower the technical barrier for new researchers in Lesotho, helping them advance culture-aware machine translation, linguistic analysis, and cultural preservation using AI. As a proof of concept, we demonstrate SeSoDa’s utility by fine-tuning the TinyLlama-1.1B-Chat model using Low-Rank Adaptation (LoRA) on entirely free Google Colab GPUs and runtime limits. This parameter efficient fine-tuning approach is particularly vital for resource-constrained environments like Lesotho, making advanced NLP model adaptation feasible and accessible without requiring extensive computational resources. We open-source the code for the dataset creation, the baseline model, and the dataset itself. We hope to see both Basotho researchers and developers build on top of our effort

  •  

Using the isiZulu GF Resource Grammar for morphological annotation

The isiZulu GF Resource Grammar (ZRG) enables syntactic parsing using the GF runtime system. In order to perform this task, the ZRG implicitly encodes rich morphosyntactic information about isiZulu. In this paper we show how such information can be made explicit by adapting the way the grammar linearises GF abstract syntax trees. The result is annotated text, which can be utilised in various ways for supporting natural language processing of an under-resourced, morphologically complex language like isiZulu.

  •  

Multilingual Data from the Agricultural Domain: Presenting the NWU-Pula/Imvula Corpora

This paper presents new multilingual corpora from the agricultural domain for seven South African Languages, namely Afrikaans, English, isiXhosa, isiZulu, Sesotho, Sesotho sa Leboa, and Setswana, based on the Pula/Imvula magazine. After pre-processing, the data has been automatically sentencized, tokenized, lemmatized and annotated with part-of-speech information using the services available at https://v-ctx-lnx7.nwu.ac.za/. The final resources comprising between 774k and 1,38M tokens per language are included on the Corpus Cooperative at North-West University (COCO@NWU) corpus platform at https://coco.nwu.ac.za/ as searchable corpora. In addition, the data can be made avail- able as text files for research purposes upon request. To highlight the value of this agricultural domain-specific data collection in relation to more general data, we also include some corpus-based statistics and comparisons with previous research.

  •