阅读视图

Large language models for history, philosophy, and sociology of science: Interpretive uses, methodological challenges, and critical perspectives

Stud Hist Philos Sci. 2026 Mar 30;117:102151. doi: 10.1016/j.shpsa.2026.102151. Online ahead of print.

ABSTRACT

This paper examines large language models (LLMs) as research tools in the history, philosophy, and sociology of science (HPSS). Because LLMs can work directly with heterogeneous, unstructured texts and capture meaning-relevant associations from usage patterns, they offer new ways to bridge close reading and corpus-scale analysis, challenging the idea that computational scale and interpretive nuance must trade off. We provide a compact primer on LLMs, covering the main components of their neural network architecture, the differences between generative and full-context models, and adaptation strategies such as fine-tuning, prompt-based learning, and retrieval-augmented generation (RAG). Building on this foundation, we analyze how LLMs recast three classic methodological problems in HPSS: working with historically messy data, detecting and interpreting large-scale patterns, and modeling scientific change over time. Across these areas we synthesize recent work in HPSS and adjacent fields, and we clarify how LLM outputs can function as exploratory prompts, as inputs to more structured pipelines, or as evidence under stricter validation and documentation. We conclude with four lessons: 1) model choice embeds interpretive trade-offs, 2) responsible use requires LLM literacy, 3) HPSS should develop its own tasks and evaluation practices, and 4) LLMs should extend rather than replace established interpretive methods. We also situate these methodological questions within broader concerns about platform dependence, accountability, and the responsibilities attached to research infrastructures. Finally, we argue that HPSS is well positioned to both use LLMs and to interrogate what counts as explanation, evidence, and responsible use in interpretive research.

PMID:41916166 | DOI:10.1016/j.shpsa.2026.102151

  •  

Migration, healthcare access, and the role of government schemes: Insights from South Indian trans women

Int J Transgend Health. 2025 Mar 15;27(2):902-916. doi: 10.1080/26895269.2025.2478092. eCollection 2026.

ABSTRACT

BACKGROUND: Research on the challenges, marginalization, and identity of trans women in India has sparked important discussions, contributing to progressive changes in society. While the visibility and recognition of trans women is steadily growing, beneficial schemes tailored to their unique challenges are often overlooked, underscoring the need for greater attention and action.

AIM: The article aims to identify the unique healthcare and migration challenges faced by South Indian trans women and the reach and utilization of government-provided facilities by their community.

METHOD: A survey of 53 and interviews with 4 South Indian trans women focused on the utilization of state and central government schemes. Data from public and private healthcare facilities in Madurai were collected and visualized using Airtable, with results disseminated in Tamil and English to ensure accessibility for the trans community.

RESULTS: A relationship was identified between the effectiveness of state government welfare schemes and the well-being of the trans women's community. However, central government schemes often fail to reach their entire population. Furthermore, state government transportation schemes do not sufficiently support their healthcare access and economic development.

DISCUSSION: To enhance the socio-economic development of trans women, policymakers can ensure that beneficial schemes comprehensively reach all segments of society. Increased promotion, awareness, and advancements in these schemes are necessary to meet the needs of the trans women community. Additionally, extending free bus fare facilities specifically to trans women is recommended to improve their healthcare access, mobility, economic opportunities, and integration into mainstream society.

PMID:41891076 | PMC:PMC13015021 | DOI:10.1080/26895269.2025.2478092

  •  

From <em>Taixi Renshen Shuogai</em> to <em>Zhenjiu Dacheng</em>: the transformation of bodily cognition in the evolution of medical illustration styles during the Ming and Qing dynasties

作者Wenjia Yi

Med Humanit. 2026 Feb 12:medhum-2025-013580. doi: 10.1136/medhum-2025-013580. Online ahead of print.

ABSTRACT

Within the context of the eastward spread of Western learning during the late Ming and early Qing dynasties, the cross-cultural dissemination of medical knowledge exhibited complex characteristics of visual transformation. This study selects Taixi Renshen Shuogai (1623) and Zhenjiu Dacheng (1601) as research samples, employing a research method combining iconographic analysis and digital humanities to explore the different body concepts and their cultural connotations carried by Chinese and Western medical illustrations. The study constructs a three-dimensional analytical framework covering visual vocabulary, content expression and quantitative statistics to conduct an in-depth analysis of the cognitive differences in human body representation between the two medical traditions. The results show that the mechanistic view of the body advocated by Western anatomy and the organic view of the body upheld by traditional Chinese medicine form a sharp contrast at the image level; the former is characterised by precise analysis and structural reduction, while the latter centres on holistic grasp and functional correlation. This difference is not only reflected in visual elements such as composition patterns and expressive techniques, but more profoundly reflects the knowledge construction logic of different epistemological systems. The coexistence of two body cognition models during the Ming and Qing dynasties reveals the selective mechanism and creative transformation ability of Chinese culture in the process of knowledge acceptance, providing a new interpretive path for understanding the interaction model between traditional culture and foreign civilisations. This study expands the methodological boundaries of medical history research and provides a historical mirror for contemporary cross-cultural medical exchanges.

PMID:41679973 | DOI:10.1136/medhum-2025-013580

  •  

Generated cultural heritage question-answer dataset: Durga in multi-dimensional perspectives

Data Brief. 2026 Jan 20;65:112495. doi: 10.1016/j.dib.2026.112495. eCollection 2026 Apr.

ABSTRACT

This dataset presents a valuable compilation of question-answer (QA) pairs derived from cultural texts and sources related to Durga mythology. A total of 21,395 QA pairs, encompassing textual materials such as scriptures, ritual narratives, temple inscriptions, and traditional storytelling records. Each entry includes the source reference, question, and corresponding answer, provided in a structured format compatible with Excel for seamless integration into downstream natural language processing (NLP) tasks. Data collection involved manual curation and annotation by domain experts, followed by preprocessing steps including text normalization, duplication removal, and verification of factual and contextual accuracy. The dataset is designed to support generative QA models, culturally aware chatbots, and digital preservation of heritage knowledge. It is particularly valuable for research in AI-driven cultural applications, educational tools, and digital humanities initiatives aiming to bridge traditional knowledge with computational methods. Researchers and practitioners may utilize the dataset for training generative models, creating interactive educational platforms, developing culturally sensitive AI agents, and supporting comparative studies in cross-cultural heritage. This openly accessible resource adheres to ethical standards, with proper attribution to source materials, and provides a foundational asset for both academic research and applied development in culturally informed artificial intelligence.

PMID:41657412 | PMC:PMC12874138 | DOI:10.1016/j.dib.2026.112495

  •  

Visual Exploration of a Historical Vietnamese Corpus of Captioned Drawings: A Case Study

IEEE Comput Graph Appl. 2026 Feb 2;PP. doi: 10.1109/MCG.2026.3660122. Online ahead of print.

ABSTRACT

This paper presents a case study focusing on the exploratory visual analysis of a unique historical dataset consisting of approximately 4000 visual sketches and associated captions from an encyclopedic book published in 1909-1910. The book, which offers insight into Vietnamese crafts and social practices, poses the challenge of extracting cultural meaning and narrative structure from thousands of drawings and multilingual captions. Our research aims to explore and evaluate the effectiveness of multiple visualization techniques in uncovering meaningful relationships within the dataset while working closely with professional historians. The main contributions of this study include refining historical research questions through task and data abstraction, combining and validating visualization techniques for historical data interpretation, and involving a focus group of historians for further evaluation. These contributions offer generalizable insights for the development of domain-specific visualization tools and support interdisciplinary engagement in historical data visualization and critical digital humanities research.

PMID:41628052 | DOI:10.1109/MCG.2026.3660122

  •  

A georeferenced dataset of archaeobotanical findings of <em>Olea europaea</em> and <em>Vitis vinifera</em> compiled from published records from Central Italy

Data Brief. 2026 Jan 7;64:112443. doi: 10.1016/j.dib.2025.112443. eCollection 2026 Feb.

ABSTRACT

Here we present a coherent, georeferenced and chronologically qualified corpus of fossil plant remains compiled from published archaeobotanical records from archaeological sites from Central Italy, focused on Olea europaea (olive) and Vitis vinifera (grape). The dataset is entirely based on secondary data and does not include newly generated primary archaeobotanical analyses. The dataset integrates site, context and all relevant archaeobotanical occurrences within a coherent relational and spatial model. The corpus was initiated through a structured bibliographic survey aided by the BRAIN database. Exclusively published literature was consulted, allowing to model archaeological sites and link them to excavation contexts and individual archaeobotanical occurrences (defined as the combination of a taxon and the specific plant part recovered, e.g., fruit, seed, rachis). The geodatabase was implemented using QGIS, with a local backend in GeoPackage, then migrated to PostgreSQL/PostGIS to support complex spatial/relational queries and future online outputs. All entities have a defined spatial placement accompanied by explicit quality-control parameters documenting positional uncertainty, source type and authority, as derived from the original published sources, ensuring transparent assessment of locational reliability. To enrich taxonomic information, an automated open thesaurus was built from CC BY/CC BY-SA resources (Floritaly, Acta Plantarum, and Wikimedia projects). The workflow employs REST-style access (or form-equivalent submissions), conservative rate-limiting, randomized waits, retries, and checkpoints; provenance and attribution (including noted transformations) are preserved. A standardized chronological table harmonizes relative cultural phases using ICCD nomenclature, with controlled fallbacks to Perio.do or peer-reviewed literature; a self-referential hierarchy (parent_id) ensures inheritance from sub-phase to broader period. Crucially, the use of open licenses, stable identifiers and cross-references makes the dataset interoperable and interlinked with the source ecosystems from which the secondary archaeobotanical data were extracted: records can resolve back to Floritaly and Acta Plantarum, and our forthcoming web portal can expose these connections for bidirectional navigation, automated updating and external reuse. The result is an interoperable, verifiable resource suitable for spatial and temporal analyses of plant remains based on aggregated and standardized published archaeobotanical data, while remaining legally reusable under the original licenses.

PMID:41624435 | PMC:PMC12855569 | DOI:10.1016/j.dib.2025.112443

  •  

Application of deep learning for transformation of Chinese traditional cultural narrative patterns and enhancement of cultural identity empowered by AIGC

Sci Rep. 2025 Dec 24;16(1):2505. doi: 10.1038/s41598-025-32302-5.

ABSTRACT

This study aims to achieve controllable generation of Chinese traditional cultural narrative content and enhance cultural identity. First, it constructs a tri-modal generation framework of text-image-style based on Stable Diffusion v2.1 and Contrastive Language-Image Pretraining (CLIP) models, realizing the joint modeling of traditional cultural semantics and visual imagery. Second, the study introduces the Low-Rank Adaptation (LoRA) mechanism to embed traditional cultural style features in a lightweight manner, improving the model's style adaptability under small sample conditions. Finally, a three-level evaluation system of "generation quality-semantic consistency-cultural identity" is built, covering both objective indicators and user feedback, to systematically verify the model's performance. Results show that the proposed model significantly outperforms existing methods in multiple dimensions: in terms of image quality, the Fréchet Inception Distance (FID) is 22.85, the Learned Perceptual Image Patch Similarity (LPIPS) is 0.298, and the style recognition accuracy reaches 86.4%. Regarding narrative consistency, the Bilingual Evaluation Understudy (BLEU) score is 0.325, the CLIP text-image similarity is 0.793, and the Narrative Style Match is 82.3%. On the cultural perception level, the average user narrative resonance is 4.32 points, the imagery accuracy score is 0.748, and the question-answer task pass rate is 82.6%. The comparative results indicate that the proposed method has significant advantages in expressive diversity and depth of cultural communication. When properly designed, Artificial Intelligence Generated Content (AIGC) technology can be effectively used for the generation and identity reconstruction of Chinese traditional cultural narrative content. This study provides a scalable technical path for the integration of AI and traditional culture, and expands the boundaries of digital humanities in content generation and reception research.

PMID:41444384 | PMC:PMC12820238 | DOI:10.1038/s41598-025-32302-5

  •  

Attack on Titan (AoT): Anime image dataset for character, scene, emotion recognition and beyond

Data Brief. 2025 Nov 8;63:112246. doi: 10.1016/j.dib.2025.112246. eCollection 2025 Dec.

ABSTRACT

Anime is an influential medium with global popularity, combining visual aesthetics with narrative depth and offering potential applications in content analysis, style transfer, and emotion recognition within computer vision research. Despite its widespread appeal, publicly available anime character datasets remain scarce. To address this gap, we propose the Attack on Titan: Anime Image Dataset, derived from the popular series Attack on Titan, to support anime-focused computer vision research. The dataset comprises 4041 high-quality images divided into 14 classes, each representing a prominent character from the series. These images are manually collected through high-resolution screenshots, capturing a wide range of character poses, expressions, costumes, and backgrounds. The dataset is suitable for various computer vision tasks, including character recognition, emotion detection, style classification, and domain adaptation.

PMID:41399437 | PMC:PMC12702017 | DOI:10.1016/j.dib.2025.112246

  •  

An interoperable catalogue of Middle and Late Bronze Age settlements in western Anatolia (c. 2000-1200 BCE)

Sci Data. 2025 Dec 1;12(1):1804. doi: 10.1038/s41597-025-06241-9.

ABSTRACT

This dataset offers a comprehensive digital catalogue of 483 archaeological settlement sites in western Anatolia dating to the Middle and Late Bronze Age (c. 2000-1200 BCE). Compiled over a decade, it brings together evidence from excavation reports, systematic surveys, historical sources, and remote sensing. Each site is georeferenced and described through a standardized set of metadata, including chronological attribution, site function, material culture, bibliographic references, and associated ancient mineral resources. The dataset is published on Zenodo as a collection of openly accessible files, structured with consistent keys that ensure integration across records. To enhance semantic interoperability, settlement entries are linked to external reference datasets such as open knowledge bases, enabling opportunities for comparative, geospatial, and interdisciplinary research spanning archaeology, digital humanities, and historical geography. By combining standardized metadata with semantic linking, the resource facilitates reuse within broader digital infrastructures. It thereby provides a transparent, openly licensed foundation for analyzing regional settlement systems and encourages more comprehensive approaches to the study of Bronze Age Anatolia.

PMID:41326413 | PMC:PMC12669694 | DOI:10.1038/s41597-025-06241-9

  •  

An AI-driven tools assessment framework for english teachers using the Fuzzy Delphi algorithm and deep learning

作者Min Yu

Sci Rep. 2025 Nov 24;15(1):41531. doi: 10.1038/s41598-025-25466-7.

ABSTRACT

English literature and linguistics have long served as foundational disciplines in humanities education, cultivating critical analysis, linguistic proficiency, and cultural interpretation. Conventional teaching methods struggle to meet diverse learner needs, ensure consistent engagement, and provide personalized academic feedback. To improve learning with the help of modern techniques, this study proposes a comprehensive, multi-technique Artificial Intelligence (AI)-driven tools assessment framework aimed at enhancing English pedagogy through the integration of advanced artificial intelligence tools. The research work includes adaptation of a mixed-methods research design incorporating classroom case studies, in-depth interviews, and analysis of students' documents to evaluate their learnings. The framework employed statistical techniques to validate significant relationships among engagement, tool usage, and learning clarity. Key evaluation criteria is captured using the Fuzzy Delphi Technique which identifies high-importance attributes such as AI usage, usability, and analytical quality. Moreover, eXplainable AI (XAI) techniques including LIME and SHAP applied to enhance model transparency, offering both global and local interpretability of outcomes. To predict pedagogical effectiveness, a deep learning Bi-LSTM model was trained, achieving 90% accuracy, 92% precision, 93% recall, and 92% F1-score across key performance metrics for the usage analysis of AI-based tools.

PMID:41286088 | PMC:PMC12644666 | DOI:10.1038/s41598-025-25466-7

  •  

Interactive learning system neural network algorithm optimization

Sci Rep. 2025 Oct 10;15(1):35498. doi: 10.1038/s41598-025-19436-2.

ABSTRACT

With the development of artificial intelligence education, the human-computer interaction and human-human interaction in virtual learning communities such as Zhihu and Quora have become research hotspots. This study has optimized the research dimensions of the virtual learning system in colleges and universities based on neural network algorithms and the value of digital intelligence in the humanities. This study aims to improve the efficiency and interactive quality of students' online learning by optimizing the interactive system of virtual learning communities in colleges. Constructed an algorithmic model for a long short-term memory (LSTM) network based on the concept of digital humanities integration. The model uses attention mechanism to improve its ability to comprehend and process question-and-answer (Q&A) content. In addition, student satisfaction with its use was investigated. The Siamese LSTM model with the attention mechanism outperforms other methods when using Word2Vec for embedding and Manhattan distance as a similarity function. The performance of the Siamese LSTM model with the introduction of the attention mechanism improves by 9%. In the evaluation of duplicate question detection on the Quora dataset, our model outperformed the previously established high-performing models, achieving an accuracy of 91.6%. Students expressed greater satisfaction with the updated interactive platform. The model in this study is more suitable than other published models for processing the SemEval Task 1 dataset. Our Q&A system, which implements simple information extraction and a natural language understanding method to answer questions, is highly rated by students.

PMID:41073596 | PMC:PMC12514068 | DOI:10.1038/s41598-025-19436-2

  •  

A libraries reproducibility hackathon: connecting students to university research and testing the longevity of published code

F1000Res. 2025 Sep 9;13:1305. doi: 10.12688/f1000research.156917.2. eCollection 2024.

ABSTRACT

BACKGROUND: Reproducibility is a basis of scientific integrity, yet it remains a significant challenge across disciplines in computational science. This reproducibility crisis is now being met with an Open Science movement, which has risen to prominence within the scientific community and academic libraries especially. At the Carnegie Mellon University Libraries, the Open Science and Data Collaborations (OSDC) Program promotes Open Science practices with resources, services, and events. Hosting hackathons in academic libraries may show promise for furthering such efforts.

METHODS: To address the need for reproducible computational research and promote Open Science within the community, members of the OSDC Program organized a single-day hackathon centered around reproducibility. Partnering with a faculty researcher in English and Digital Humanities, we invited community members to reuse Python code and data from a research publication deposited to Harvard Dataverse. We also published these materials as a compute capsule in Code Ocean that participants could also access. Additionally, we investigated ways to use ChatGPT to troubleshoot errors from rerunning this code.

RESULTS: Three students from the School of Computer Science participated in this hackathon. Accessing materials from Harvard Dataverse, these students found success reproducing most of the data visualizations, but they required some manual setup and modifications to address depreciated libraries used in the code. Alternatively, we found Code Ocean to be a highly accessible option, free from depreciation risk. Last, ChatGPT also aided in finding and addressing the same roadblocks to successfully reproduce the same figures as the participating students.

CONCLUSIONS: This hackathon allowed several students an opportunity to interact with and evaluate real research outputs, testing the reproducibility of computational data analyses. Partnering with faculty opened opportunities to improve open research materials. This case study outlines one approach for other academic libraries to highlight challenges that face reproducibility in an interactive setting.

PMID:41064702 | PMC:PMC12501581 | DOI:10.12688/f1000research.156917.2

  •  

Personal memory and distant reading can complement each other: a reply to Gillon

J Med Ethics. 2025 Sep 4:jme-2025-111310. doi: 10.1136/jme-2025-111310. Online ahead of print.

ABSTRACT

We respond to Gillon's critique of our data-driven analysis of the history of Journal of Medical Ethics (JME), in which we used a topic model to trace intellectual trends in the journal's first 50 years. Gillon, drawing on his personal memories as JME's second (and longest serving) editor, challenges several of our findings, particularly those concerning the prominence and classification of topics such as Ethics education In this reply, we clarify misunderstandings that led to part of his criticisms of our method. At the same time, we also briefly discuss some nuances of topic modelling, in particular, its reliance on simplified representations of text, sensitivity to modeling choices and topic interpretations. Rather than viewing computational models and editorial memory as competing sources of insight, we propose that they are complementary: each illuminates different dimensions of the journal's evolution. Gillon's engagement with our work ultimately highlights the importance of methodological transparency and the value of combining digital humanities tools with lived experience in the historiography of academic disciplines.

PMID:40908135 | DOI:10.1136/jme-2025-111310

  •  

Link to link: The 'Lesbian Eroticverse' of personal narratives and Indian Digital Platforms

J Lesbian Stud. 2025 Sep 4:1-17. doi: 10.1080/10894160.2025.2555666. Online ahead of print.

ABSTRACT

Following the 2000s, India witnessed an emergence of cyber-queer spaces in the form of several online multimedia platforms, social media networking and dating sites which explored the intersections of gender and sexuality. Through an analysis of the queer 'I' in digital personal narratives by queer AFAB individuals on multimedia platforms like Agents of Ishq (AOI) and Point of View (POV), we argue that both the design of these websites and this written personal history contain a self-affirming erotic power. This power is enmeshed with and informed by an Indian lesbian political history which demands that we rethink our relationship with the challenges of lesbian desire, intersectionality, caste subalternity and the politics of inclusion both online and offline.

PMID:40905382 | DOI:10.1080/10894160.2025.2555666

  •  

Assessing advanced handwritten text recognition engines for digitizing historical documents

Int J Digit Humanit. 2025;7(1):115-134. doi: 10.1007/s42803-025-00100-0. Epub 2025 May 12.

ABSTRACT

This study provides critical insights and evaluates the performance of state-of-the-art Handwritten Text Recognition (HTR) engines-PyLaia, HTR + , IDA, TrOCR-f, and Transkribus' proprietary Transformer-based "supermodel" Titan-to digitize historical documents. Using a diverse range of datasets that include different scripts, this research assesses each engine's accuracy and efficiency in handling multilingual content, complex styles, abbreviations, and historical orthography. Results indicate that, while all engines can be trained or fine-tuned to improve performance, Titan and TrOCR-f exhibit superior out-of-the-box capabilities for Latin-script documents. PyLaia, IDA, and HTR + excel in specific non-Latin scripts when specifically trained or fine-tuned. This study underscores the importance of training, fine-tuning, and integrating language models, providing critical insights for future advancements in HTR technology and its application in the digital humanities.

PMID:40584138 | PMC:PMC12202554 | DOI:10.1007/s42803-025-00100-0

  •  

Forging an interdisciplinary lens for understanding community digital archives of South Asian diaspora

Front Sociol. 2025 Jun 5;10:1450641. doi: 10.3389/fsoc.2025.1450641. eCollection 2025.

ABSTRACT

Different communities have begun archiving their own experiences and histories as a way to reclaim narratives and contend with their own identities and belonging. As the types of archives diversify and the role of digital technologies in archival practices expands, we are increasingly seeing digital community archival efforts. While archives have been key for carrying out research in the social sciences and the humanities, and are periodically found as topics of study in disciplinary subfields concerning themselves with the digital, there is little research on the specific subject of community digital archives. In this essay, I argue that community digital archives are important objects of sociological and historical inquiry. I discuss two community digital archives of the South Asian diaspora - the South Asian American Digital Archive (SAADA) and 1947 Partition Archive. I show that they offer insights into migration histories and notions of belonging and identity of South Asian diaspora not only through the digital records they produce, but also through how they operate using digital connectivity. I demonstrate that an interdisciplinary lens is key for critically engaging with these archives.

PMID:40538462 | PMC:PMC12177577 | DOI:10.3389/fsoc.2025.1450641

  •  

Biodigital Philosophy, Technological Convergence, and Postdigital Knowledge Ecologies

Postdigit Sci Educ. 2021;3(2):370-388. doi: 10.1007/s42438-020-00211-7. Epub 2021 Jan 11.

ABSTRACT

New technological ability is leading postdigital science, where biology as digital information, and digital information as biology, are now dialectically interconnected. In this article we firstly explore a philosophy of biodigitalism as a new paradigm closely linked to bioinformationalism. Both involve the mutual interaction and integration of information and biology, which leads us into discussion of biodigital convergence. As a unified ecosystem, this allows us to resolve problems that isolated disciplinary capabilities cannot, creating new knowledge ecologies within a constellation of technoscience. To illustrate our arrival at this historical flash point via several major epistemological shifts in the post-war period, we venture a tentative typology. The convergence between biology and information reconfigures all levels of theory and practice, and even critical reason itself now requires a biodigital interpretation oriented towards ecosystems and coordinated Earth systems. In this understanding, neither the digital humanities, the biohumanities, nor the posthumanities sit outside of biodigitalism. Instead, posthumanism is but one form of biodigitalism that mediates the biohumanities and the digital humanities, no longer preoccupied with the tradition of the subject, but with the constellation of forces shaping the future of human ontologies. This heralds a new biopolitics which brings the philosophy of race, class, gender, and intelligence, into a compelling dialog with genomics and information.

PMID:40477145 | PMC:PMC7797699 | DOI:10.1007/s42438-020-00211-7

  •  

Diversity statistics of onomastic data reveal social patterns in Hebrew Kingdoms of the Iron Age

Proc Natl Acad Sci U S A. 2025 May 20;122(20):e2503850122. doi: 10.1073/pnas.2503850122. Epub 2025 May 14.

ABSTRACT

The distribution of personal names provides unique, yet often overlooked, insight into modern and historical societies. This study employs diversity statistics-commonly used in ecology-to analyze onomastic data from Iron Age II archaeological excavations in the Southern Levant (950-586 BCE). Our findings reveal higher onomastic diversity in the Kingdom of Israel compared to Judah, suggesting a more cosmopolitan society. We also observe a decrease in name diversity in Judah over time, potentially reflecting sociopolitical changes. Center/periphery analysis shows contrasting patterns in Israel and Judah. These results provide insights into social dynamics, cultural interactions, and identity formation in these ancient societies. Our methodology, validated using supplementary archaeological data, as well as modern datasets, offers a robust framework for applying diversity statistics across various modern and historical contexts.

PMID:40366687 | PMC:PMC12107089 | DOI:10.1073/pnas.2503850122

  •  

CORHOH: Text corpus of holocaust oral histories

Data Brief. 2025 Feb 24;59:111426. doi: 10.1016/j.dib.2025.111426. eCollection 2025 Apr.

ABSTRACT

This paper outlines the compilation and annotation process of CORHOH: Text CORpus of Holocaust Oral Histories. The corpus consists of 500 oral histories, each narrative form one survivor. The transcripts of the oral histories are retrieved from the Let Them Speak Project [1]. The transcripts are normalized and further annotated. The corpus offers rich metadata about both the testimony givers and the interviews. All technical content is removed, and a unique identifier is assigned to each question (posed by the interviewer) and answer (provided by the survivor). The corpus complies with the TEI guidelines [2]. The corpus includes 106,519 questions and 107,125 answers, making it easy to distinguish between the utterances that belong to the holocaust survivor or anyone else who is involved in the interview, primarily the interviewer. CORHOH is particularly suited for studies on trauma expression and psychological concepts embedded in survivors' narratives. Additionally, it offers potential for data mining to uncover patterns (e.g., migration trends) and supports natural language processing techniques, such as topic modelling, sentiment analysis, and named entity recognition. The CORHOH data is courtesy of the United States Holocaust Memorial Museum (USHMM) and is publicly available under the CC BY-NC-SA 4.0 license.

PMID:40124297 | PMC:PMC11927712 | DOI:10.1016/j.dib.2025.111426

  •