普通视图

Received yesterday — 2026年5月23日学术期刊(海外)
Received before yesterday学术期刊(海外)

Infrastructures of Listening: The ManoWhisper Podcast Analysis Pipeline

ManoWhisper is an end-to-end research pipeline for collecting, transcribing, and analyzing hateful and misogynistic podcast content, built to support peer-reviewed and policy-facing research on gender-based extremism. This paper argues the tool reframes harmful media as a site of feminist methodological inquiry, with implications for understanding how such content spreads across platforms and into AI training data

Digital bioethics: exploring an emerging field

Med Health Care Philos. 2026 Apr 16. doi: 10.1007/s11019-026-10347-1. Online ahead of print.

ABSTRACT

The uptake of social science methods by bioethics significantly expanded its methodological spectrum, raising new theoretical, methodological, and practical questions. Recently, we are witnessing another trend, adding advanced data science methods to bioethics' toolkit to aid, for example, in online data analysis, support scholarly writing, and inform clinical ethics. This article explores the emerging field of Digital Bioethics across its dimensions by analysing the tangled relationship between topics and methods, highlighting intersections between Digital Bioethics and Bioethics of the Digital, and advocating for a methods-based definition of the field. The use of advanced data science methods within bioethics must be interpreted in the context of the use of Artificial Intelligence (AI) in health care. At the same time, it presents unique opportunities and challenges. Defining, and thus demarcating, Digital Bioethics can create support for the new field but also requires navigating trade-offs. To do so, we take four kindred academic fields as points of comparison (Digital Humanities, Experimental Philosophical Bioethics, computational medicine and digitised biology) to analyse what each of them teaches for critically assessing and further developing Digital Bioethics. The article discusses potential pitfalls and concludes with recommendations on how the field can fully develop its potential to promote bioethical research and argument. Furthermore, the article discusses how a critical reflection of the use of AI methods within bioethics itself will also contribute to the ethical oversight of increasingly AI-driven branches of healthcare.

PMID:41989660 | DOI:10.1007/s11019-026-10347-1

Introduction: Reading Code in the Age of AI

In their introduction to the second DHQ special issue on Critical Code Studies, Jeremy Douglass and Mark C. Marino survey recent developments in the field and argue that generative AI makes critical code reading more urgent, not less.

Do all politicians sound the same? Comparing model explanations to human responses

It is a commonly held belief that all politicians sound the same but do they, actually? We combine machine-learning and the model explainability method SHAP with human judgements to measure how distinct plenary speeches in the Finnish Parliament truly are and which features make them so.

Facets of Friction: Investigating epistemological friction between computing and the humanities to support Digital Humanities computing education

2026年2月21日 08:00
This article argues that the consideration of epistemological friction is essential to DH computing. It proposes a framework outlining common sites of friction with the intention that it be used as a pedagogical scaffold.

Decoding and Encoding Welsh Manuscript Culture: Scribes, Scripts and TEI

A detailed case study describing the conversion of Daniel Huws’ seminal Repertory of Welsh Manuscripts and Scribes into a dataset for analysis and publication, giving an insight into common issues in the conversion of printed texts into datasets, and a glimpse of how the dataset is giving us new insights into Welsh manuscript culture.

Generated cultural heritage question-answer dataset: Durga in multi-dimensional perspectives

Data Brief. 2026 Jan 20;65:112495. doi: 10.1016/j.dib.2026.112495. eCollection 2026 Apr.

ABSTRACT

This dataset presents a valuable compilation of question-answer (QA) pairs derived from cultural texts and sources related to Durga mythology. A total of 21,395 QA pairs, encompassing textual materials such as scriptures, ritual narratives, temple inscriptions, and traditional storytelling records. Each entry includes the source reference, question, and corresponding answer, provided in a structured format compatible with Excel for seamless integration into downstream natural language processing (NLP) tasks. Data collection involved manual curation and annotation by domain experts, followed by preprocessing steps including text normalization, duplication removal, and verification of factual and contextual accuracy. The dataset is designed to support generative QA models, culturally aware chatbots, and digital preservation of heritage knowledge. It is particularly valuable for research in AI-driven cultural applications, educational tools, and digital humanities initiatives aiming to bridge traditional knowledge with computational methods. Researchers and practitioners may utilize the dataset for training generative models, creating interactive educational platforms, developing culturally sensitive AI agents, and supporting comparative studies in cross-cultural heritage. This openly accessible resource adheres to ethical standards, with proper attribution to source materials, and provides a foundational asset for both academic research and applied development in culturally informed artificial intelligence.

PMID:41657412 | PMC:PMC12874138 | DOI:10.1016/j.dib.2026.112495

A georeferenced dataset of archaeobotanical findings of <em>Olea europaea</em> and <em>Vitis vinifera</em> compiled from published records from Central Italy

2026年2月2日 19:00

Data Brief. 2026 Jan 7;64:112443. doi: 10.1016/j.dib.2025.112443. eCollection 2026 Feb.

ABSTRACT

Here we present a coherent, georeferenced and chronologically qualified corpus of fossil plant remains compiled from published archaeobotanical records from archaeological sites from Central Italy, focused on Olea europaea (olive) and Vitis vinifera (grape). The dataset is entirely based on secondary data and does not include newly generated primary archaeobotanical analyses. The dataset integrates site, context and all relevant archaeobotanical occurrences within a coherent relational and spatial model. The corpus was initiated through a structured bibliographic survey aided by the BRAIN database. Exclusively published literature was consulted, allowing to model archaeological sites and link them to excavation contexts and individual archaeobotanical occurrences (defined as the combination of a taxon and the specific plant part recovered, e.g., fruit, seed, rachis). The geodatabase was implemented using QGIS, with a local backend in GeoPackage, then migrated to PostgreSQL/PostGIS to support complex spatial/relational queries and future online outputs. All entities have a defined spatial placement accompanied by explicit quality-control parameters documenting positional uncertainty, source type and authority, as derived from the original published sources, ensuring transparent assessment of locational reliability. To enrich taxonomic information, an automated open thesaurus was built from CC BY/CC BY-SA resources (Floritaly, Acta Plantarum, and Wikimedia projects). The workflow employs REST-style access (or form-equivalent submissions), conservative rate-limiting, randomized waits, retries, and checkpoints; provenance and attribution (including noted transformations) are preserved. A standardized chronological table harmonizes relative cultural phases using ICCD nomenclature, with controlled fallbacks to Perio.do or peer-reviewed literature; a self-referential hierarchy (parent_id) ensures inheritance from sub-phase to broader period. Crucially, the use of open licenses, stable identifiers and cross-references makes the dataset interoperable and interlinked with the source ecosystems from which the secondary archaeobotanical data were extracted: records can resolve back to Floritaly and Acta Plantarum, and our forthcoming web portal can expose these connections for bidirectional navigation, automated updating and external reuse. The result is an interoperable, verifiable resource suitable for spatial and temporal analyses of plant remains based on aggregated and standardized published archaeobotanical data, while remaining legally reusable under the original licenses.

PMID:41624435 | PMC:PMC12855569 | DOI:10.1016/j.dib.2025.112443

Preface to the Proceedings of RAIL 2025

The sixth workshop on Resources for African Indigenous Languages (RAIL) was held on 10 November 2025 at the CSIR International Convention Centre in Pretoria, South Africa. It was co-located with the Digital Humanities Association of Southern Africa (DHASA) 2025 conference, which took place from 11 to 14 November 2025

Ideational Analysis and Integration of African Folktale in Science, Technology, and Education

2025年12月31日 08:00

Folktales are literary forms that reveal the soul of any society; they express its wishes, desires, hopes, and beliefs about the world. They have fictional characters and situations, mostly oral traditions, before they were written down. According to Cynthia McDaniel (1993), folktales can be used in all disciplines to convey knowledge and communicate ideas; they serve as an inherent vehicle for intergenerational communication that prepares and assigns roles and responsibilities to different generations in their communities. They are more pedagogic devices and less literary pieces. They cultivate universal values such as compassion, generosity, and honesty while disapproving of attributes such as cruelty, greed, and dishonesty. To illustrate McDaniel's claims, this paper will firstly use the ideational metafunctional framework found in Systemic Functional Linguistics, which expresses the clausal experiences and content from a grammatical perspective, coupled with syntagmatic analysis, which describes the text (folktale) in chronological order as reported by the storyteller. Secondly, the presentation will use a textual metafunctional framework that fulfills the thematic function of the clause, coupled with the paradigmatic analysis where the folkloristic text's patterns are regrouped more analytically to reveal the text's latent content, or theme. The Voyant Tool, a web-based text reading and analysis environment designed to facilitate the analysis of various text formats, was used to extract and analyze data from a Sesotho folktale to illustrate how folktales may be integrated with technology for research and educational purposes. This paper employed a descriptive research design that incorporates qualitative (content analysis) and quantitative (statistical analysis) methodologies to analyze and interpret the story. It is observed, through the Voyant tool, that the story is built out of 191 Sesotho word formations, and through the ideational analysis, that the storyteller employed more material process types than mental process types, and lastly, with the textual interpretation, indicating the value of oral literature in our daily lives as well as the significant role folktales may play in interpreting sociopolitical events in contemporary communities.

An exploration of the computational identification of English loan words in Sesotho

2025年12月31日 08:00

South Africa, with its twelve official languages, is an inherently multilingual country. As such, speakers of many of the languages have been in direct contact. This has led to a cross-over of words and phrases between languages. In this article, we provide a methodology to identify words that are (potentially) borrowed from another language. We test our approach by trying to identify words that moved from English into Sesotho (or potentially the other way around). To do this, we start with a bilingual Sesotho-English dictionary (Bukantswe).
We then develop a lexicographic comparison method that takes a pair of lexical items (English and Sesotho) and computes a range of distance metrics. These distance metrics are applied to the raw words (i.e., comparing orthography), but using the Soundex algorithm, an approximate phonological comparison can be made as well. Unfortunately, Bukantswe does not contain complete annotation of loan words, so a quantitative evaluation is not currently possible. We provide a qualitative analysis of the results, which shows that many loan words can be found, but in some cases lexical items that have a high similarity are not loan words. We discuss different situations related to the influence of orthography, phonology, syllable structure, and morphology. The approach itself is language independent, so it can also be applied to other language pairs, e.g., Afrikaans and Sesotho, or more related languages, such
as isiXhosa and isiZulu.

Mafoko: Structuring and Building Open Multilingual Terminologies for South African NLP

The critical lack of structured terminological data for South Africa’s official languages hampers progress in multilingual NLP, despite the existence of numerous government and academic terminology lists. These valuable assets remain fragmented and locked in non-machine-readable formats, rendering them unusable for computational research and development. Mafoko addresses this challenge by systematically aggregating, cleaning, and standardising these scattered resources into open, interoperable datasets. We introduce the foundational Mafoko dataset, released under the equitable, Africa-centered NOODL framework. To demonstrate its immediate utility, we integrate the terminology into a Retrieval-Augmented Generation (RAG) pipeline. Experiments show substantial improvements in the accuracy and domain-specific consistency of English-to-Tshivenda machine translation for large language models. Mafoko provides a scalable foundation for developing robust and equitable NLP technologies, ensuring South Africa’s rich linguistic diversity is represented in the digital age.

Following Digital Footprints: Researching South African Digital Poetry

2025年12月31日 08:00

Contemporary scholarship increasingly recognises the need to document the growing corpus of African literature being produced and distributed via social media and other online platforms. In African literature and the future, Ogundipe (2015) declared that: In the search for a viable path for the future of African literature, a well-crafted vision of the future and effective strategies to engender transformation are imperative. This raises the practical application of the digital space, the internet and related innovative technology as new paradigms of knowledge to African literary engagement. But the absence of a critical standard remains a bane of this development. To address this critical imperative and further explore the prevalence of such works, I collected a dataset to find examples of literary trends and key recent examples of significant works, informed by Moretti's scholarship on distant reading. The dataset focuses on poetry written by younger South African authors from the Born Free Generation, in line with my broader research. The main purpose of this paper is to present my findings and the theoretical and methodological framework that informed them. The paper concludes by briefly proposing some possible means of expanding this research and proposing a large-scale online archival project.

Shedding Light on Loadshedding with Natural Language Processing: A social media case study on public perspectives of the South African electricity crisis in 2022

2025年12月31日 08:00

In times of collective discomfort and dissatisfaction, people often find solace in shared adversity on social media platforms like X (formerly known as Twitter). These platforms offer a unique window into the public’s emotions andviewpoints concerning common challenges. I n2022, South Africa experienced an electricity crisis, during which the country was subjectedto rolling blackouts, commonly known as load-shedding, by Eskom, the country’s primary electricity provider, to prevent a national electricity grid shutdown. This study conducted adata-driven exploration of the public discourse surrounding Eskom and loadshedding on X using natural language processing and data science techniques. The dataset utilised for thisstudy comprised tweets containing keywords related to Eskom and loadshedding. The studydelved into the topics of discussion by applying topic modelling techniques to uncover latent themes within the discourse. The topics were analysed through a multifaceted lens to unpack and highlight patterns within the sentiments, emotions and biases that underpin conversations related to loadshedding and Eskom. A notable inclusion in the analysis was the incorporation of sarcasm classifications,which enhanced the interpretation of the emotion and sentiment within the topics discussed.The findings uncovered from the analysis were contrasted with loadshedding-related events in 2022 to understand the public discourse as the electricity crisis escalated. The methodologyof this study provides a framework for utilising natural language processing techniques touncover and examine the perspectives of a collective within discourse related to events of shared interest.

❌