普通视图

Received before yesterday学术期刊(海外)

Refrigerator Wisdom: Social Rules and Rights as a Conceptual Framework for Digital Ethics

2026年1月27日 11:58

Is it possible that examining the rules which govern our use of an everyday appliance can assist data practitioners in collecting and using data in ethical and responsible ways? This inquiry is based on a theory that current and emerging frameworks for data ethics share a palpable synergy with social rules that govern our use of everyday technologies, like the household refrigerator. [...]

Data Supremacy: Race In-Formation Through Herman Hollerith’s Tabulating Machine

2026年1月16日 16:56

In this essay, I examine how data was racialized in the United States through the Tabulating Machine, developed by the German American inventor Herman Hollerith in the 1880s to automate census tabulation. During that time, scientific theories of race posited racial categories to be biologically distinct and hierarchized, thereby defining “whiteness” as a favoured trait. As such, the [...]

Preface to the Proceedings of RAIL 2025

The sixth workshop on Resources for African Indigenous Languages (RAIL) was held on 10 November 2025 at the CSIR International Convention Centre in Pretoria, South Africa. It was co-located with the Digital Humanities Association of Southern Africa (DHASA) 2025 conference, which took place from 11 to 14 November 2025

Siswati Part of Speech Tagger: A Quantitative Evaluation

2025年12月31日 08:00

This article evaluates the performance of the Siswati Text Annotation Tool part of speech (STAT POS) tagger using Recall, Precision and F1 score metrics. A quantitative research design was adopted for analysis, and data was collected through purposive sampling. Python 3 was utilised to calculate the Recall and Precision of the STAT POS tagger outputs. The results show that the Recall for nouns was 0.761, Precision 0.417, with an F1 score of 0.54; for verbs, the Recall was 0.756, Precision 0.798 and F1 score 0.54; for adverbs, the Recall was 0.571, Precision 0.8, and F1 score 0.67; for possessives, the Recall was 0.963, Precision 0.813 and F1 score 0.88. For relatives (REL), the Recall was 0.706, Precision 0.523, and the F1 score 0.60; for class-indicating demonstratives, the Recall was 0.333, Precision 0.25, and the F1 score 0.29; and for copulatives (COP), the Recall was 0.75, Precision 0.75, and the F1 score 0.75. For conjunctions, the Recall was 0.85, the Precision was 0.68, and the F1 score was 0.76; for pronouns, the Recall was 0.563, the Precision was 1.0, and the F1 score was 0.72; for adjectives, the Recall was 0.75, the Precision was 0.75, and the F1 score was 0.75. However, question words, interjections and ideophones received 0.0., highlighting the need for refinement of the STAT POS tagger.

Ideational Analysis and Integration of African Folktale in Science, Technology, and Education

2025年12月31日 08:00

Folktales are literary forms that reveal the soul of any society; they express its wishes, desires, hopes, and beliefs about the world. They have fictional characters and situations, mostly oral traditions, before they were written down. According to Cynthia McDaniel (1993), folktales can be used in all disciplines to convey knowledge and communicate ideas; they serve as an inherent vehicle for intergenerational communication that prepares and assigns roles and responsibilities to different generations in their communities. They are more pedagogic devices and less literary pieces. They cultivate universal values such as compassion, generosity, and honesty while disapproving of attributes such as cruelty, greed, and dishonesty. To illustrate McDaniel's claims, this paper will firstly use the ideational metafunctional framework found in Systemic Functional Linguistics, which expresses the clausal experiences and content from a grammatical perspective, coupled with syntagmatic analysis, which describes the text (folktale) in chronological order as reported by the storyteller. Secondly, the presentation will use a textual metafunctional framework that fulfills the thematic function of the clause, coupled with the paradigmatic analysis where the folkloristic text's patterns are regrouped more analytically to reveal the text's latent content, or theme. The Voyant Tool, a web-based text reading and analysis environment designed to facilitate the analysis of various text formats, was used to extract and analyze data from a Sesotho folktale to illustrate how folktales may be integrated with technology for research and educational purposes. This paper employed a descriptive research design that incorporates qualitative (content analysis) and quantitative (statistical analysis) methodologies to analyze and interpret the story. It is observed, through the Voyant tool, that the story is built out of 191 Sesotho word formations, and through the ideational analysis, that the storyteller employed more material process types than mental process types, and lastly, with the textual interpretation, indicating the value of oral literature in our daily lives as well as the significant role folktales may play in interpreting sociopolitical events in contemporary communities.

Building Corpora for Low-Resource Kenyan Languages

2025年12月31日 08:00

Natural Language Processing is a crucial frontier in artificial intelligence, with broad application across public health, agriculture, education, and commerce. However, due to the lack of substantial linguistic resources, many African languages remain underrepresented in this digital transformation. This article presents a case study on the development of linguistic corpora for three under-resourced Kenyan languages, Kidaw’ida, Kalenjin, and Dholuo, with the aim of advancing natural language processing and linguistic research in African communities. Our project, which lasted one year,
employed a selective crowd-sourcing methodology to collect text and speech data from native speakers of these languages. Data collection involved (1) recording and transcribing conver-
sations and translating the resulting text into Kiswahili, creating parallel corpora, and (2) reading and recording written texts to generate speech corpora. We made these resources
freely accessible on open-research platforms, namely Zenodo for the parallel text corpora and Mozilla Common Voice for the speech datasets, thereby facilitating ongoing contributions and
developer access to train models and develop Natural Language Processing applications.

Stop words in Khoekhoe

2025年12月31日 08:00

Stop word lists are useful resources that allow for the filtering of words in texts that typically do not carry (much) content. Filtering stop words can improve the efficiency and accuracy of data processing. Stop words are typically short and occur very frequently in texts. Stop word lists are language dependent and many low-resource languages currently do not have (accurate) stop word lists. In this article, we look at how we can create, based on word frequency, a stop word list for Khoekhoe, which is a low-resource language spoken in Southern Africa. Given that stop words do not carry much content, they can be expected to occur consistently across different texts. We compare lists of most frequent words between texts in different genres and which words feature in these lists consistently. We look at the overlap of frequent words in English texts and compare these to a known English stop word list as well, and compare the results with the overlap of frequent words in Khoekhoe texts. The results show that there is a high overlap between genres for English, but the overlap between the Khoekhoe genres is lower. This may be due to a different typological profile of Khoekhoe. This means that creating a stop word list for Khoekhoe is more complicated and most likely requires other techniques to produce a useful stop word list.

Creating Bilingual Corpora for isiZulu: A Case Study from the University of KwaZulu-Natal

2025年12月31日 08:00

Although several bilingual resources exist, there is a lack of domain-specific, institutionally verified parallel corpus focusing on academic and administrative texts. Existing datasets such as Autshumato English–isiZulu corpus, UNISA English/Zulu Parallel Corpus, and the WebCrawl African Corpus hosted on GitHub provide valuable material but differ in accessibility, domain coverage, and documentation. To complement these initiatives, the University Language Planning and Development Office (ULPDO) at the University of KwaZulu-Natal has developed a curated isiZulu–English Parallel Corpus comprising 10,000 carefully aligned sentence pairs drawn from institutional and academic texts. This paper outlines the corpus compilation process, including data sourcing, cleaning, alignment, and validation, and discusses key structural and linguistic challenges encountered. The resource contributes to translation studies, terminology development, and multilingual natural language processing, while supporting ongoing efforts to advance the digital presence and intellectualisation of isiZulu.

Traditional Readability Approaches in Sesotho and isiZulu

2025年12月31日 08:00

This paper presents a conceptual overview of traditional readability metrics adapted for two South African Indigenous languages, isiZulu and Sesotho, which differ orthographically with conjunctive and disjunctive writing systems, respectively. Both languages are low-resource, lacking extensive corpora, lexicons, and pretrained models necessary for automatic readability assessment. By critically examining these adaptations, we highlight the challenges of applying English-based metrics to morphologically complex African languages and emphasise the need for language-specific digital resources that reflect local linguistic structures. Our work aligns with ongoing efforts to develop and enhance language resources for under-resourced African Indigenous languages, thereby supporting their evolving presence and accessibility in the digital age, including contexts shaped by large language models.

An exploration of the computational identification of English loan words in Sesotho

2025年12月31日 08:00

South Africa, with its twelve official languages, is an inherently multilingual country. As such, speakers of many of the languages have been in direct contact. This has led to a cross-over of words and phrases between languages. In this article, we provide a methodology to identify words that are (potentially) borrowed from another language. We test our approach by trying to identify words that moved from English into Sesotho (or potentially the other way around). To do this, we start with a bilingual Sesotho-English dictionary (Bukantswe).
We then develop a lexicographic comparison method that takes a pair of lexical items (English and Sesotho) and computes a range of distance metrics. These distance metrics are applied to the raw words (i.e., comparing orthography), but using the Soundex algorithm, an approximate phonological comparison can be made as well. Unfortunately, Bukantswe does not contain complete annotation of loan words, so a quantitative evaluation is not currently possible. We provide a qualitative analysis of the results, which shows that many loan words can be found, but in some cases lexical items that have a high similarity are not loan words. We discuss different situations related to the influence of orthography, phonology, syllable structure, and morphology. The approach itself is language independent, so it can also be applied to other language pairs, e.g., Afrikaans and Sesotho, or more related languages, such
as isiXhosa and isiZulu.

SeSoDa: A Compact Context-Rich Sesotho-English Dataset for LoRA Fine-Tuning of SLMs

2025年12月31日 08:00

We introduce SeSoDa, a multidomain Sesotho(Sa Lesotho)-English dataset of 1,966 prompt-completion pairs that span six categories (nouns, verbs, idioms, quantifiers, grammar rules, usage alerts). SeSoDa documents the morphosyntactic complexity, uncaptured Basotho cultural specificity, and orthographic/phonological differences between Lesotho and South African Sesotho. We created a user-friendly, JSON-style corpus with detailed metadata. This aims to lower the technical barrier for new researchers in Lesotho, helping them advance culture-aware machine translation, linguistic analysis, and cultural preservation using AI. As a proof of concept, we demonstrate SeSoDa’s utility by fine-tuning the TinyLlama-1.1B-Chat model using Low-Rank Adaptation (LoRA) on entirely free Google Colab GPUs and runtime limits. This parameter efficient fine-tuning approach is particularly vital for resource-constrained environments like Lesotho, making advanced NLP model adaptation feasible and accessible without requiring extensive computational resources. We open-source the code for the dataset creation, the baseline model, and the dataset itself. We hope to see both Basotho researchers and developers build on top of our effort

Using the isiZulu GF Resource Grammar for morphological annotation

2025年12月31日 08:00

The isiZulu GF Resource Grammar (ZRG) enables syntactic parsing using the GF runtime system. In order to perform this task, the ZRG implicitly encodes rich morphosyntactic information about isiZulu. In this paper we show how such information can be made explicit by adapting the way the grammar linearises GF abstract syntax trees. The result is annotated text, which can be utilised in various ways for supporting natural language processing of an under-resourced, morphologically complex language like isiZulu.

Multilingual Data from the Agricultural Domain: Presenting the NWU-Pula/Imvula Corpora

2025年12月31日 08:00

This paper presents new multilingual corpora from the agricultural domain for seven South African Languages, namely Afrikaans, English, isiXhosa, isiZulu, Sesotho, Sesotho sa Leboa, and Setswana, based on the Pula/Imvula magazine. After pre-processing, the data has been automatically sentencized, tokenized, lemmatized and annotated with part-of-speech information using the services available at https://v-ctx-lnx7.nwu.ac.za/. The final resources comprising between 774k and 1,38M tokens per language are included on the Corpus Cooperative at North-West University (COCO@NWU) corpus platform at https://coco.nwu.ac.za/ as searchable corpora. In addition, the data can be made avail- able as text files for research purposes upon request. To highlight the value of this agricultural domain-specific data collection in relation to more general data, we also include some corpus-based statistics and comparisons with previous research.

Bootstrapping Siswati lexical resources from isiZulu

2025年12月31日 08:00

IsiZulu and Siswati are closely related languages that share significant morphosyntactic characteristics. Systematic differences between these languages have been identified at the phonological and morphosyntactic levels. Due to the resource-scarce status of these languages, this similarity has led to bootstrapping of computational language resources at the morphological and syntactic levels. In this work, we investigate the feasibility of adapting lexical items in a computational lexicon from isiZulu to Siswati. We use Grammatical Framework resource grammars for both languages to analyse and transform lexical items, which are then evaluated against a parallel term list. An iterative process yields a success rate of 70.5 %, indicating that this approach is largely viable as a means of significantly reducing the manual effort needed to develop lexicons for computational resources for Siswati.

Multimodal Classification System for Hausa Using LLMs and Vision Transformers

2025年12月31日 08:00

This paper presents a classification-based Vi-
sual Question Answering (VQA) system for the
Hausa language, integrating Large Language
Models (LLMs) and vision transformers. By
fine-tuning LLMs on monolingual Hausa text
and fusing their representations with those of
state-of-the-art vision encoders, our system pre-
dicts answers from a fixed vocabulary. Exper-
iments conducted on the HaVQA dataset, un-
der offline text–image augmentation regimes,
tailored to the specificity of Hausa as a low-
resource language, show that this augmentation
strategy yields the best performance over the
baseline, achieving 35.85% accuracy, 35.89%
WuPalmer similarity, and 15.32% F1-score.

Mafoko: Structuring and Building Open Multilingual Terminologies for South African NLP

The critical lack of structured terminological data for South Africa’s official languages hampers progress in multilingual NLP, despite the existence of numerous government and academic terminology lists. These valuable assets remain fragmented and locked in non-machine-readable formats, rendering them unusable for computational research and development. Mafoko addresses this challenge by systematically aggregating, cleaning, and standardising these scattered resources into open, interoperable datasets. We introduce the foundational Mafoko dataset, released under the equitable, Africa-centered NOODL framework. To demonstrate its immediate utility, we integrate the terminology into a Retrieval-Augmented Generation (RAG) pipeline. Experiments show substantial improvements in the accuracy and domain-specific consistency of English-to-Tshivenda machine translation for large language models. Mafoko provides a scalable foundation for developing robust and equitable NLP technologies, ensuring South Africa’s rich linguistic diversity is represented in the digital age.

Compiling Specialised Glossaries: A Case Study of English-isiNdebele Medical Terms

2025年12月31日 08:00

The inability to access health care because of a language barrier is a matter of concern which this study seeks to address. There is a critical need to access health care communication by amaNdebele, an ethnic group who speak Southern Ndebele known as isiNdebele, one of the official languages of South Africa. IsiNdebele belongs to the Nguni family, alongside isiZulu, isiXhosa and Siswati. The study examines the compilation of a specialised English-isiNdebele glossary of medical terms. It investigates the methodologies and challenges that are involved in the creation of bilingual medical terminologies for English and isiNdebele. A corpus-based approach together with the prescriptive and descriptive lexicographic methods was employed. Medical corpora were compiled, and consultation with the isiNdebele speaking medical community also took place. This was done so that medical terms which are culturally appropriate could be developed. The findings of this study reveal notable gaps in the medical terminology of isiNdebele. The findings further emphasise the significance of term formation in which the medical community (nurses and doctors) and language (isiNdebele) experts are involved. This study contributes to the broader discourse on equity in healthcare. It also contributes to the intellectualisation of African languages in post-apartheid South Africa.

❌