阅读视图

Visual Exploration of a Historical Vietnamese Corpus of Captioned Drawings: A Case Study

IEEE Comput Graph Appl. 2026 Feb 2;PP. doi: 10.1109/MCG.2026.3660122. Online ahead of print.

ABSTRACT

This paper presents a case study focusing on the exploratory visual analysis of a unique historical dataset consisting of approximately 4000 visual sketches and associated captions from an encyclopedic book published in 1909-1910. The book, which offers insight into Vietnamese crafts and social practices, poses the challenge of extracting cultural meaning and narrative structure from thousands of drawings and multilingual captions. Our research aims to explore and evaluate the effectiveness of multiple visualization techniques in uncovering meaningful relationships within the dataset while working closely with professional historians. The main contributions of this study include refining historical research questions through task and data abstraction, combining and validating visualization techniques for historical data interpretation, and involving a focus group of historians for further evaluation. These contributions offer generalizable insights for the development of domain-specific visualization tools and support interdisciplinary engagement in historical data visualization and critical digital humanities research.

PMID:41628052 | DOI:10.1109/MCG.2026.3660122

  •  

A georeferenced dataset of archaeobotanical findings of <em>Olea europaea</em> and <em>Vitis vinifera</em> compiled from published records from Central Italy

Data Brief. 2026 Jan 7;64:112443. doi: 10.1016/j.dib.2025.112443. eCollection 2026 Feb.

ABSTRACT

Here we present a coherent, georeferenced and chronologically qualified corpus of fossil plant remains compiled from published archaeobotanical records from archaeological sites from Central Italy, focused on Olea europaea (olive) and Vitis vinifera (grape). The dataset is entirely based on secondary data and does not include newly generated primary archaeobotanical analyses. The dataset integrates site, context and all relevant archaeobotanical occurrences within a coherent relational and spatial model. The corpus was initiated through a structured bibliographic survey aided by the BRAIN database. Exclusively published literature was consulted, allowing to model archaeological sites and link them to excavation contexts and individual archaeobotanical occurrences (defined as the combination of a taxon and the specific plant part recovered, e.g., fruit, seed, rachis). The geodatabase was implemented using QGIS, with a local backend in GeoPackage, then migrated to PostgreSQL/PostGIS to support complex spatial/relational queries and future online outputs. All entities have a defined spatial placement accompanied by explicit quality-control parameters documenting positional uncertainty, source type and authority, as derived from the original published sources, ensuring transparent assessment of locational reliability. To enrich taxonomic information, an automated open thesaurus was built from CC BY/CC BY-SA resources (Floritaly, Acta Plantarum, and Wikimedia projects). The workflow employs REST-style access (or form-equivalent submissions), conservative rate-limiting, randomized waits, retries, and checkpoints; provenance and attribution (including noted transformations) are preserved. A standardized chronological table harmonizes relative cultural phases using ICCD nomenclature, with controlled fallbacks to Perio.do or peer-reviewed literature; a self-referential hierarchy (parent_id) ensures inheritance from sub-phase to broader period. Crucially, the use of open licenses, stable identifiers and cross-references makes the dataset interoperable and interlinked with the source ecosystems from which the secondary archaeobotanical data were extracted: records can resolve back to Floritaly and Acta Plantarum, and our forthcoming web portal can expose these connections for bidirectional navigation, automated updating and external reuse. The result is an interoperable, verifiable resource suitable for spatial and temporal analyses of plant remains based on aggregated and standardized published archaeobotanical data, while remaining legally reusable under the original licenses.

PMID:41624435 | PMC:PMC12855569 | DOI:10.1016/j.dib.2025.112443

  •  

Application of deep learning for transformation of Chinese traditional cultural narrative patterns and enhancement of cultural identity empowered by AIGC

Sci Rep. 2025 Dec 24;16(1):2505. doi: 10.1038/s41598-025-32302-5.

ABSTRACT

This study aims to achieve controllable generation of Chinese traditional cultural narrative content and enhance cultural identity. First, it constructs a tri-modal generation framework of text-image-style based on Stable Diffusion v2.1 and Contrastive Language-Image Pretraining (CLIP) models, realizing the joint modeling of traditional cultural semantics and visual imagery. Second, the study introduces the Low-Rank Adaptation (LoRA) mechanism to embed traditional cultural style features in a lightweight manner, improving the model's style adaptability under small sample conditions. Finally, a three-level evaluation system of "generation quality-semantic consistency-cultural identity" is built, covering both objective indicators and user feedback, to systematically verify the model's performance. Results show that the proposed model significantly outperforms existing methods in multiple dimensions: in terms of image quality, the Fréchet Inception Distance (FID) is 22.85, the Learned Perceptual Image Patch Similarity (LPIPS) is 0.298, and the style recognition accuracy reaches 86.4%. Regarding narrative consistency, the Bilingual Evaluation Understudy (BLEU) score is 0.325, the CLIP text-image similarity is 0.793, and the Narrative Style Match is 82.3%. On the cultural perception level, the average user narrative resonance is 4.32 points, the imagery accuracy score is 0.748, and the question-answer task pass rate is 82.6%. The comparative results indicate that the proposed method has significant advantages in expressive diversity and depth of cultural communication. When properly designed, Artificial Intelligence Generated Content (AIGC) technology can be effectively used for the generation and identity reconstruction of Chinese traditional cultural narrative content. This study provides a scalable technical path for the integration of AI and traditional culture, and expands the boundaries of digital humanities in content generation and reception research.

PMID:41444384 | PMC:PMC12820238 | DOI:10.1038/s41598-025-32302-5

  •  

Attack on Titan (AoT): Anime image dataset for character, scene, emotion recognition and beyond

Data Brief. 2025 Nov 8;63:112246. doi: 10.1016/j.dib.2025.112246. eCollection 2025 Dec.

ABSTRACT

Anime is an influential medium with global popularity, combining visual aesthetics with narrative depth and offering potential applications in content analysis, style transfer, and emotion recognition within computer vision research. Despite its widespread appeal, publicly available anime character datasets remain scarce. To address this gap, we propose the Attack on Titan: Anime Image Dataset, derived from the popular series Attack on Titan, to support anime-focused computer vision research. The dataset comprises 4041 high-quality images divided into 14 classes, each representing a prominent character from the series. These images are manually collected through high-resolution screenshots, capturing a wide range of character poses, expressions, costumes, and backgrounds. The dataset is suitable for various computer vision tasks, including character recognition, emotion detection, style classification, and domain adaptation.

PMID:41399437 | PMC:PMC12702017 | DOI:10.1016/j.dib.2025.112246

  •  

An interoperable catalogue of Middle and Late Bronze Age settlements in western Anatolia (c. 2000-1200 BCE)

Sci Data. 2025 Dec 1;12(1):1804. doi: 10.1038/s41597-025-06241-9.

ABSTRACT

This dataset offers a comprehensive digital catalogue of 483 archaeological settlement sites in western Anatolia dating to the Middle and Late Bronze Age (c. 2000-1200 BCE). Compiled over a decade, it brings together evidence from excavation reports, systematic surveys, historical sources, and remote sensing. Each site is georeferenced and described through a standardized set of metadata, including chronological attribution, site function, material culture, bibliographic references, and associated ancient mineral resources. The dataset is published on Zenodo as a collection of openly accessible files, structured with consistent keys that ensure integration across records. To enhance semantic interoperability, settlement entries are linked to external reference datasets such as open knowledge bases, enabling opportunities for comparative, geospatial, and interdisciplinary research spanning archaeology, digital humanities, and historical geography. By combining standardized metadata with semantic linking, the resource facilitates reuse within broader digital infrastructures. It thereby provides a transparent, openly licensed foundation for analyzing regional settlement systems and encourages more comprehensive approaches to the study of Bronze Age Anatolia.

PMID:41326413 | PMC:PMC12669694 | DOI:10.1038/s41597-025-06241-9

  •  

An AI-driven tools assessment framework for english teachers using the Fuzzy Delphi algorithm and deep learning

作者Min Yu

Sci Rep. 2025 Nov 24;15(1):41531. doi: 10.1038/s41598-025-25466-7.

ABSTRACT

English literature and linguistics have long served as foundational disciplines in humanities education, cultivating critical analysis, linguistic proficiency, and cultural interpretation. Conventional teaching methods struggle to meet diverse learner needs, ensure consistent engagement, and provide personalized academic feedback. To improve learning with the help of modern techniques, this study proposes a comprehensive, multi-technique Artificial Intelligence (AI)-driven tools assessment framework aimed at enhancing English pedagogy through the integration of advanced artificial intelligence tools. The research work includes adaptation of a mixed-methods research design incorporating classroom case studies, in-depth interviews, and analysis of students' documents to evaluate their learnings. The framework employed statistical techniques to validate significant relationships among engagement, tool usage, and learning clarity. Key evaluation criteria is captured using the Fuzzy Delphi Technique which identifies high-importance attributes such as AI usage, usability, and analytical quality. Moreover, eXplainable AI (XAI) techniques including LIME and SHAP applied to enhance model transparency, offering both global and local interpretability of outcomes. To predict pedagogical effectiveness, a deep learning Bi-LSTM model was trained, achieving 90% accuracy, 92% precision, 93% recall, and 92% F1-score across key performance metrics for the usage analysis of AI-based tools.

PMID:41286088 | PMC:PMC12644666 | DOI:10.1038/s41598-025-25466-7

  •  

Interactive learning system neural network algorithm optimization

Sci Rep. 2025 Oct 10;15(1):35498. doi: 10.1038/s41598-025-19436-2.

ABSTRACT

With the development of artificial intelligence education, the human-computer interaction and human-human interaction in virtual learning communities such as Zhihu and Quora have become research hotspots. This study has optimized the research dimensions of the virtual learning system in colleges and universities based on neural network algorithms and the value of digital intelligence in the humanities. This study aims to improve the efficiency and interactive quality of students' online learning by optimizing the interactive system of virtual learning communities in colleges. Constructed an algorithmic model for a long short-term memory (LSTM) network based on the concept of digital humanities integration. The model uses attention mechanism to improve its ability to comprehend and process question-and-answer (Q&A) content. In addition, student satisfaction with its use was investigated. The Siamese LSTM model with the attention mechanism outperforms other methods when using Word2Vec for embedding and Manhattan distance as a similarity function. The performance of the Siamese LSTM model with the introduction of the attention mechanism improves by 9%. In the evaluation of duplicate question detection on the Quora dataset, our model outperformed the previously established high-performing models, achieving an accuracy of 91.6%. Students expressed greater satisfaction with the updated interactive platform. The model in this study is more suitable than other published models for processing the SemEval Task 1 dataset. Our Q&A system, which implements simple information extraction and a natural language understanding method to answer questions, is highly rated by students.

PMID:41073596 | PMC:PMC12514068 | DOI:10.1038/s41598-025-19436-2

  •  

A libraries reproducibility hackathon: connecting students to university research and testing the longevity of published code

F1000Res. 2025 Sep 9;13:1305. doi: 10.12688/f1000research.156917.2. eCollection 2024.

ABSTRACT

BACKGROUND: Reproducibility is a basis of scientific integrity, yet it remains a significant challenge across disciplines in computational science. This reproducibility crisis is now being met with an Open Science movement, which has risen to prominence within the scientific community and academic libraries especially. At the Carnegie Mellon University Libraries, the Open Science and Data Collaborations (OSDC) Program promotes Open Science practices with resources, services, and events. Hosting hackathons in academic libraries may show promise for furthering such efforts.

METHODS: To address the need for reproducible computational research and promote Open Science within the community, members of the OSDC Program organized a single-day hackathon centered around reproducibility. Partnering with a faculty researcher in English and Digital Humanities, we invited community members to reuse Python code and data from a research publication deposited to Harvard Dataverse. We also published these materials as a compute capsule in Code Ocean that participants could also access. Additionally, we investigated ways to use ChatGPT to troubleshoot errors from rerunning this code.

RESULTS: Three students from the School of Computer Science participated in this hackathon. Accessing materials from Harvard Dataverse, these students found success reproducing most of the data visualizations, but they required some manual setup and modifications to address depreciated libraries used in the code. Alternatively, we found Code Ocean to be a highly accessible option, free from depreciation risk. Last, ChatGPT also aided in finding and addressing the same roadblocks to successfully reproduce the same figures as the participating students.

CONCLUSIONS: This hackathon allowed several students an opportunity to interact with and evaluate real research outputs, testing the reproducibility of computational data analyses. Partnering with faculty opened opportunities to improve open research materials. This case study outlines one approach for other academic libraries to highlight challenges that face reproducibility in an interactive setting.

PMID:41064702 | PMC:PMC12501581 | DOI:10.12688/f1000research.156917.2

  •  

Personal memory and distant reading can complement each other: a reply to Gillon

J Med Ethics. 2025 Sep 4:jme-2025-111310. doi: 10.1136/jme-2025-111310. Online ahead of print.

ABSTRACT

We respond to Gillon's critique of our data-driven analysis of the history of Journal of Medical Ethics (JME), in which we used a topic model to trace intellectual trends in the journal's first 50 years. Gillon, drawing on his personal memories as JME's second (and longest serving) editor, challenges several of our findings, particularly those concerning the prominence and classification of topics such as Ethics education In this reply, we clarify misunderstandings that led to part of his criticisms of our method. At the same time, we also briefly discuss some nuances of topic modelling, in particular, its reliance on simplified representations of text, sensitivity to modeling choices and topic interpretations. Rather than viewing computational models and editorial memory as competing sources of insight, we propose that they are complementary: each illuminates different dimensions of the journal's evolution. Gillon's engagement with our work ultimately highlights the importance of methodological transparency and the value of combining digital humanities tools with lived experience in the historiography of academic disciplines.

PMID:40908135 | DOI:10.1136/jme-2025-111310

  •  

Link to link: The 'Lesbian Eroticverse' of personal narratives and Indian Digital Platforms

J Lesbian Stud. 2025 Sep 4:1-17. doi: 10.1080/10894160.2025.2555666. Online ahead of print.

ABSTRACT

Following the 2000s, India witnessed an emergence of cyber-queer spaces in the form of several online multimedia platforms, social media networking and dating sites which explored the intersections of gender and sexuality. Through an analysis of the queer 'I' in digital personal narratives by queer AFAB individuals on multimedia platforms like Agents of Ishq (AOI) and Point of View (POV), we argue that both the design of these websites and this written personal history contain a self-affirming erotic power. This power is enmeshed with and informed by an Indian lesbian political history which demands that we rethink our relationship with the challenges of lesbian desire, intersectionality, caste subalternity and the politics of inclusion both online and offline.

PMID:40905382 | DOI:10.1080/10894160.2025.2555666

  •  

Assessing advanced handwritten text recognition engines for digitizing historical documents

Int J Digit Humanit. 2025;7(1):115-134. doi: 10.1007/s42803-025-00100-0. Epub 2025 May 12.

ABSTRACT

This study provides critical insights and evaluates the performance of state-of-the-art Handwritten Text Recognition (HTR) engines-PyLaia, HTR + , IDA, TrOCR-f, and Transkribus' proprietary Transformer-based "supermodel" Titan-to digitize historical documents. Using a diverse range of datasets that include different scripts, this research assesses each engine's accuracy and efficiency in handling multilingual content, complex styles, abbreviations, and historical orthography. Results indicate that, while all engines can be trained or fine-tuned to improve performance, Titan and TrOCR-f exhibit superior out-of-the-box capabilities for Latin-script documents. PyLaia, IDA, and HTR + excel in specific non-Latin scripts when specifically trained or fine-tuned. This study underscores the importance of training, fine-tuning, and integrating language models, providing critical insights for future advancements in HTR technology and its application in the digital humanities.

PMID:40584138 | PMC:PMC12202554 | DOI:10.1007/s42803-025-00100-0

  •  

Forging an interdisciplinary lens for understanding community digital archives of South Asian diaspora

Front Sociol. 2025 Jun 5;10:1450641. doi: 10.3389/fsoc.2025.1450641. eCollection 2025.

ABSTRACT

Different communities have begun archiving their own experiences and histories as a way to reclaim narratives and contend with their own identities and belonging. As the types of archives diversify and the role of digital technologies in archival practices expands, we are increasingly seeing digital community archival efforts. While archives have been key for carrying out research in the social sciences and the humanities, and are periodically found as topics of study in disciplinary subfields concerning themselves with the digital, there is little research on the specific subject of community digital archives. In this essay, I argue that community digital archives are important objects of sociological and historical inquiry. I discuss two community digital archives of the South Asian diaspora - the South Asian American Digital Archive (SAADA) and 1947 Partition Archive. I show that they offer insights into migration histories and notions of belonging and identity of South Asian diaspora not only through the digital records they produce, but also through how they operate using digital connectivity. I demonstrate that an interdisciplinary lens is key for critically engaging with these archives.

PMID:40538462 | PMC:PMC12177577 | DOI:10.3389/fsoc.2025.1450641

  •  

Biodigital Philosophy, Technological Convergence, and Postdigital Knowledge Ecologies

Postdigit Sci Educ. 2021;3(2):370-388. doi: 10.1007/s42438-020-00211-7. Epub 2021 Jan 11.

ABSTRACT

New technological ability is leading postdigital science, where biology as digital information, and digital information as biology, are now dialectically interconnected. In this article we firstly explore a philosophy of biodigitalism as a new paradigm closely linked to bioinformationalism. Both involve the mutual interaction and integration of information and biology, which leads us into discussion of biodigital convergence. As a unified ecosystem, this allows us to resolve problems that isolated disciplinary capabilities cannot, creating new knowledge ecologies within a constellation of technoscience. To illustrate our arrival at this historical flash point via several major epistemological shifts in the post-war period, we venture a tentative typology. The convergence between biology and information reconfigures all levels of theory and practice, and even critical reason itself now requires a biodigital interpretation oriented towards ecosystems and coordinated Earth systems. In this understanding, neither the digital humanities, the biohumanities, nor the posthumanities sit outside of biodigitalism. Instead, posthumanism is but one form of biodigitalism that mediates the biohumanities and the digital humanities, no longer preoccupied with the tradition of the subject, but with the constellation of forces shaping the future of human ontologies. This heralds a new biopolitics which brings the philosophy of race, class, gender, and intelligence, into a compelling dialog with genomics and information.

PMID:40477145 | PMC:PMC7797699 | DOI:10.1007/s42438-020-00211-7

  •  

Diversity statistics of onomastic data reveal social patterns in Hebrew Kingdoms of the Iron Age

Proc Natl Acad Sci U S A. 2025 May 20;122(20):e2503850122. doi: 10.1073/pnas.2503850122. Epub 2025 May 14.

ABSTRACT

The distribution of personal names provides unique, yet often overlooked, insight into modern and historical societies. This study employs diversity statistics-commonly used in ecology-to analyze onomastic data from Iron Age II archaeological excavations in the Southern Levant (950-586 BCE). Our findings reveal higher onomastic diversity in the Kingdom of Israel compared to Judah, suggesting a more cosmopolitan society. We also observe a decrease in name diversity in Judah over time, potentially reflecting sociopolitical changes. Center/periphery analysis shows contrasting patterns in Israel and Judah. These results provide insights into social dynamics, cultural interactions, and identity formation in these ancient societies. Our methodology, validated using supplementary archaeological data, as well as modern datasets, offers a robust framework for applying diversity statistics across various modern and historical contexts.

PMID:40366687 | PMC:PMC12107089 | DOI:10.1073/pnas.2503850122

  •  

CORHOH: Text corpus of holocaust oral histories

Data Brief. 2025 Feb 24;59:111426. doi: 10.1016/j.dib.2025.111426. eCollection 2025 Apr.

ABSTRACT

This paper outlines the compilation and annotation process of CORHOH: Text CORpus of Holocaust Oral Histories. The corpus consists of 500 oral histories, each narrative form one survivor. The transcripts of the oral histories are retrieved from the Let Them Speak Project [1]. The transcripts are normalized and further annotated. The corpus offers rich metadata about both the testimony givers and the interviews. All technical content is removed, and a unique identifier is assigned to each question (posed by the interviewer) and answer (provided by the survivor). The corpus complies with the TEI guidelines [2]. The corpus includes 106,519 questions and 107,125 answers, making it easy to distinguish between the utterances that belong to the holocaust survivor or anyone else who is involved in the interview, primarily the interviewer. CORHOH is particularly suited for studies on trauma expression and psychological concepts embedded in survivors' narratives. Additionally, it offers potential for data mining to uncover patterns (e.g., migration trends) and supports natural language processing techniques, such as topic modelling, sentiment analysis, and named entity recognition. The CORHOH data is courtesy of the United States Holocaust Memorial Museum (USHMM) and is publicly available under the CC BY-NC-SA 4.0 license.

PMID:40124297 | PMC:PMC11927712 | DOI:10.1016/j.dib.2025.111426

  •  

From Shared Horizons to Impactful Collaboration and Engagement: The Interagency Partnership of the National Endowment for the Humanities/National Library of Medicine, 2012-2024

Interag J. 2024;14(2):17-31.

ABSTRACT

In 2012, leaders in the National Endowment for the Humanities (NEH) and the National Library of Medicine (NLM) established an interagency partnership to collaborate on research, education, and career initiatives located at the intersection of biomedical and humanities research. Shortly thereafter, the agencies joined with the Maryland Institute for Technology in the Humanities and Research Councils UK (now known as UK Research and Innovation) to convene the symposium Shared Horizons: Data, Biomedicine, and the Digital Humanities. Researchers Erez Aiden and Jean-Baptiste Michel praised the symposium for "betray[ing] an astonishing optimism: the idea that historians and philosophers and artists and doctors and biologists, thinking about data together, can advance their individual causes better than any of them can alone." Aiden and Michael continued "The conference title…was dead-on. At the interface of all our work lies the most exciting terrain in our intellectual future" (206-7). Ten years on, the NEH-NLM interagency partnership has catalyzed and facilitated joint leadership yielding multiple collaborations, engagements, public programs, and open access publications involving dozens of individuals and touching thousands more. At every turn these initiatives have advanced the complementary missions of the NEH and the NLM, including their commitment to open access publishing, as defined by UNESCO to be "the provision of free access to peer reviewed, scholarly and research information to all,… requir[ing] that the rights holder grants worldwide irrevocable right of access to copy, use, distribute, transmit, and make derivative works in any format for any lawful activities with proper attribution to the original author." This article takes stock of the NEH-NLM interagency partnership, conveying its impact on and relevance to the public service of both agencies. As discussed, the NEH-NLM partnership advances a "whole of society approach" toward improving individual and public health writ large: not only in terms of connecting lab, clinic, and community, but also more broadly in terms of supporting the dissemination of trusted health information and sharing knowledge about the human condition across time and place and as studied by a variety of disciplines ranging from the sciences to the social sciences to the humanities. The partnership also advances a "whole of government" approach toward making government more efficient, transparent, accessible, and impactful through outcomes that are not possible when working in isolation. Examining a decade-plus history demonstrating leadership, management, and mutual support among public sector colleagues, this article points to fundamental lessons learned to help achieve "whole of government" activities in other contexts for the greater good.

PMID:40109503 | PMC:PMC11921643

  •  

Spatiotemporal distribution characteristics of Nanjing place names-Based on data mining of Tang-Song poetry and online travelogues

PLoS One. 2025 Feb 24;20(2):e0319244. doi: 10.1371/journal.pone.0319244. eCollection 2025.

ABSTRACT

Tang-Song poetry, a distinguished element of China's traditional cultural heritage, is intricately linked with the historical and cultural development of Chinese cities. This paper uses Nanjing as a case study and applies digital humanities techniques to analyze and compare the spatiotemporal distribution of place names found in Tang-Song poetry with those in online travel narratives. The aim is to uncover key factors that have influenced the cultural continuity of these historical cities and their relevance today. Findings indicate that: (1) Locations mentioned in Tang and Song poetry show significant spatial differentiation, with urban areas displaying a clustered distribution and suburbs showing scattered hotspots. (2) The number of locations referenced in Song poetry increased significantly compared to Tang poetry, suggesting that Nanjing's economic growth heightened the city's appeal and inspired more literary output. (3) Song Dynasty poetry reflects a shift toward more neutral and negative emotions, with a marked decrease in positive expressions. This rise in negative sentiment can be traced to the decline in national strength from the Tang to the Song Dynasty, amplifying Nanjing's role as a place of reflection and mourning. (4) Nanjing's cultural hotspots, such as Xuanwu Lake and the Zhongshan Scenic Area, feature prominently in both Tang-Song poetry and modern travelogues. This study contributes to research in literary geography and literary tourism at the urban spatial level, offering fresh insights into the cultural legacy of historical cities.

PMID:39992992 | PMC:PMC12005594 | DOI:10.1371/journal.pone.0319244

  •  

MBI-KG: A knowledge graph of structured and linked economic research data extracted from the 1937 book "Die Maschinen-Industrie im Deutschen Reich"

Data Brief. 2024 Dec 17;58:111238. doi: 10.1016/j.dib.2024.111238. eCollection 2025 Feb.

ABSTRACT

The MaschinenBauIndustrie Knowledge Graph (MBI-KG) is a structured and semantically enriched dataset extracted from the 1937 publication "Die Maschinen-Industrie im Deutschen Reich" (The Machinery Industry in the German Reich), published by the "Wirtschaftsgruppe Maschinenbau" and edited by Herbert Patschan. This historical source offers data on German companies within the mechanical engineering industry during the pre-World War II era. The book was digitized, and Optical Character Recognition (OCR) was applied to extract text. The unstructured extracted data was then structured and semantically enriched to enable data integration and reuse. The semantically enriched data was uploaded into an open-source knowledge-graph software. The resulting knowledge graph includes detailed information about companies, individuals, and administrative entities relevant to the German mechanical engineering industry. The data is accessible through various means, including a SPARQL endpoint, an API, advanced search functionalities, a reconciliation API, and bulk files. Each entity in the knowledge graph can be exported in multiple formats, such as CSV, RDF (ttl), JSON, and NDJSON, ensuring compatibility with diverse research tools and platforms. This dataset can be reused in various research domains, including economic history, data science, and digital humanities. By providing machine-readable, structured data from a crucial historical period, the MBI-KG facilitates novel analyses and insights into the economic and industrial landscape of early 20th-century Germany. The dataset's interoperability with other data sources and its alignment with FAIR principles further enhance its value for interdisciplinary research and long-term preservation.

PMID:39830614 | PMC:PMC11742587 | DOI:10.1016/j.dib.2024.111238

  •