Today our guest is Dr. Jeff Turner. Jeff, I’m going to share what I’ve prepared about you and then you’re welcome to fill in the gaps. So Jeff received his PhD in US history from the University of Utah. His expertise lies in digital humanities, American religious history, and migration. And his research traces the ways migration and immigration inspectors and policymakers construct religion at the US border in the late 19th century and early 20th centuries.
Professor Turner’s digital work spans a variety of digital humanities methods. He entered DH along with Grassroots Graduate Student Group at the University of Utah, who taught themselves to topic model and published an article together.
His subsequent experience came from wanting to understand the relationship between critical theory, and project building, and also a desire to pay the rent. So you’re very practical. He’s worked on public humanities projects such as the Century of Black Mormons, Native Places Atlas, and the Wilford Woodruff Papers Project. And he works in Python, JavaScript, HTML, CSS, and a little bit of R in SQL.
Season: 5
Episode: 1
Date: 3/2025
Presenter: Jeffrey Turner
Topic: Religion at the American Boarder, early twentieth century
Tags: OCR; Machine Learning; History; Digital Humanities
As part of our blog series, “Stories from the Research Trenches,” we often invite researchers and colleagues to share their personal experiences. For this installment of the series, we are delighted to have our colleague André Davids from KU Leuven Library of Economics and Business share about his recent Erasmus+ stay at the University of Mannheim. André talks specifically about the opportunity to explore Optical Character Recognition (OCR) tools, a topic that Faculty of Arts researchers often seek advice about. Read about André’s experience learning about various OCR software options, his takeaways on how they do things at the University of Mannheim, and his impressions about the city itself.
Meanwhile, somewhere else: Erasmus+ in Mannheim
Hello, I am André, and in March 2023, as part of the Erasmus+ program, I spent five days at the University of Mannheim. Why did I choose Mannheim? In the context of my work at the Library of Economics and Business, where I am involved, among other things, with OCR (Optical Character Recognition), it quickly became clear after some online research that they are very actively engaged in that field there.
I was warmly received at the library by Stefan Weil, one of the most active current developers of the OCR software Tesseract. He told me a lot about the university and the city, but also introduced me to the world of Linux, Ubuntu, Debian. In addition, I was able to experiment with various OCR software (Tesseract, eScriptorium, Pero-OCR) and received more information about the OCR-D Project.
In Mannheim, they primarily work on the further development of open-source software. Additionally, they offer support to students and researchers in using this software. Once a month, they organize an open online OCR consultation hour in collaboration with the University of Heidelberg, where anyone can ask their OCR-related questions. The “clients” are mainly researchers, but also library staff from other universities.
Also interesting to mention: The library has a room, the ExpLAB, which is dedicated to brainstorming, Design Thinking, etc. This room is fully equipped for brainstorming sessions, but also has Eye-Tracking Stations, Virtual Reality glasses, etc., which can be used by both students and staff.
This Erasmus+ experience not only enriched my knowledge about OCR but also about the city and university. Although Mannheim is a well-known city, I didn’t know much about it myself. Due to its architecture, it was chosen by the Allies in 1940 as a place to experiment with air raids and complete city destruction. As a result, there wasn’t much left of the city after World War II, and it had to be rebuilt. After long debates, the Baroque Palace (Barockschloss) was also rebuilt. Luckily so, because in 1967, the University of Mannheim could establish itself there. This building, with its width of 450 meters, is the second-largest baroque palace in Europe, after Versailles (but – and this is important – it has one more window than Versailles).
baroque palace, mannheim
Navigating the city was quite a challenge since the city center has no street names but has been divided into squares since the 17th century. The most striking street is the one in front of the university, the “Kurpfälzer Meile der Innovationen” (Palatinate Mile of Innovation), which has 42 bronze plaques on the ground honoring famous innovators such as Carl Benz (automobile), Karl von Drais (precursor to the bicycle), Werner von Siemens, and others. Maybe an idea for KU Leuven?
What stuck with me most in terms of their work culture is the Teams channel called “Mittagessen” (lunch). This is where colleagues arrange lunch plans. This is also how I met a colleague who, as a student, did Erasmus at KU Leuven. I still don’t fully understand their working hours. Apparently, they work 40 hours a week, but I was always the first one there and one of the last to leave… Maybe they calculate time differently there. Everywhere is different, but a lot is still familiar. I look back very positively on my trip to another library and can highly recommend it to everyone.
Also interesting to see is the university library’s introductory video:
Im Rahmen der Love Data Week 2025 lädt Sie das Kompetenzzentrum OCR der Universitätsbibliotheken Tübingen und Mannheim zu einer speziellen Veranstaltung unserer OCR-Sprechstunde ein! Am Donnerstag, den 13. Februar 2025 stehen wir Ihnen diesmal etwas länger von 15:00 bis 16:30 Uhr über Zoom zur Verfügung.
Neben Ihren Fragen zur automatischen Texterkennung von Handschriften und Druckschriften werden Ihnen spannende Einblicke und Hinweise zu rechtlichen Aspekten von der Digitalisierung über die Volltexterkennung bis hin zur Bereitstellung und Nachnutzung der Volltexte durch die Juristin Vasilka Stoilova (UB Mannheim & BERD@NFDI) geboten. Nutzen Sie die Gelegenheit, Ihre rechtlichen Fragen zum Thema OCR an eine Expertin zu richten.
Two months into this fellowship, I have prayed in the following places:
The Grad lounge
Brandon’s office
Shane’s office
Amanda’s office
The first time, it felt strange. I had barely known everyone for a week. I didn’t want to make anyone uncomfortable. I didn’t want to seem like I was putting on a show of religiosity. I didn’t want to be stereotyped and put into a box.
Each time I asked if I could pray in the Scholars’ Lab space, those around me were extremely accommodating, offering to leave the room to give me privacy. That made it feel like even more of an imposition. I felt too conspicuous, too seen. The kinder everyone was, the more uncomfortable I felt. I couldn’t make sense of it. Why did this kindness make me feel like an outsider?
Soon enough, the afternoon prayer started eliciting other uncomfortable thoughts. Once, as I unfurled my prayer mat, I wondered if the DH tools we discovered would ever support Punjabi or Urdu (my research languages). Shane and I had spent an entire morning trying Tesseract’s OCR software on images with Persian, Urdu, and Punjabi text, but the invariable result was gibberish. A few weeks later, when I wanted my name in both English and Urdu on our Charter website, Jeremy said he’d figure out if and how that was possible. I nearly told him to forget I mentioned it. I remember noticing how brown my skin was as I prayed that day.
The experience of double consciousness each time I pray in the Scholars’ Lab is a stark reminder that I don’t fully belong in the ‘Digital’ Humanities. I have to be accommodated for, adjusted to, and worked around. It doesn’t matter how sincerely the Scholars’ Lab staff welcome me into their physical space. As soon as we face a laptop screen, I am stripped down to an anglicized, areligious, apolitical version of myself. For the computer only recognizes these fragments. Here, too, it has become the job of the SLab folks to stretch themselves in unexpected ways to make me whole again: by trying to find digital platforms and tools with Right-To-Left (RTL) language support; by hunting down essays on Global DH and Minimal Computing; by dredging up their own insecurities and limitations in conversations to assure me of my place in DH.
The message is clear: It takes the kindness and effort of individual DH scholars to make space for me within systems that were not designed for people like me. Grateful as I am, it is not kindness I want, but the chance to be an equal collaborator. To create and share knowledge across the linguistic communities I belong to.
In a recent paper, Masoud Ghorbaninejad, Nathan P. Gibson and David Joseph Wrisley have discussed the Anglocentric nature of current DH infrastructures that largely ignore the “digital habitus”1 of RTL language users. They state that “knowledge is not just cultural content embedded in language; it is also infrastructure that allows that content to be represented, circulated, and preserved for the concerned communities.” Of the many tools I have discovered these past few months – Omeka, Voyant tools, MALLET, Tesseract, to name a few – not a single one supports Urdu or Punjabi in any meaningful way. As a multilingual South Asian and a student of Muslim literatures, each interaction with these tools involves two things: (1) silencing the very voices within me that have already undergone violence at the hands of the English language, and (2) a fervent hope for alternatives.
Im Rahmen des DH-Kolloquiums an der BBAW laden wir Sie herzlich zum nächsten Termin am Montag, den 30. September 2024, 16 Uhr c.t., ein (virtueller Raum: https://meet.gwdg.de/b/lou-eyn-nm6-t6b):
Christian Reul (Universität Würzburg) über Automatische Texterkennung für die (digitalen) Geisteswissenschaften – OCR4all als Open-Source-Ansatz
***
Ein zentraler Aspekt der Arbeit von geistes- und kulturwissenschaftlich Forschenden ist die Auseinandersetzung mit historischen Quellen in Form von gedruckten und handschriftlichen Textzeugen. Diese liegen häufig lediglich als Scans vor, aus denen zunächst maschinenverarbeitbarer Volltext extrahiert werden muss, wozu Methoden der automatischen Texterkennung zum Einsatz kommen. Dabei stellen gerade sehr alte Drucke und Handschriften aus verschiedensten Gründen häufig noch eine große Herausforderung dar. Das am Zentrum für Philologie und Digitalität (ZPD) der Universität Würzburg entwickelte, frei verfügbare Open-Source-Tool OCR4all hat zum Ziel, auch technisch weniger versierten Nutzenden die Möglichkeit zu geben, anspruchsvolle Drucke und Handschriften selbstständig und in höchster Qualität zu erschließen. OCR4all fasst den gesamten Texterkennungsworkflow und alle dafür benötigten Tools in einer einzigen Anwendung zusammen, die über eine komfortable grafische Nutzeroberfläche bedient werden kann.
Der Vortrag erläutert zunächst die Grundlagen der automatischen Texterkennung und stellt OCR4all und dessen Funktionsweise vor. Weiterhin wird die Anwendbarkeit und Performanz auf unterschiedlichem Material demonstriert und ein Überblick über aktuelle Arbeiten sowie ein Ausblick auf zukünftige Entwicklungen gegeben.
***
Die Veranstaltung findet virtuell statt; eine Anmeldung ist nicht notwendig. Zum Termin ist der virtuelle Konferenzraum über den Link https://meet.gwdg.de/b/lou-eyn-nm6-t6b erreichbar. Wir möchten Sie bitten, bei Eintritt in den Raum Mikrofon und Kamera zu deaktivieren. Nach Beginn der Diskussion können Wortmeldungen durch das Aktivieren der Kamera signalisiert werden.
Der Fokus der Veranstaltung liegt sowohl auf praxisnahen Themen und konkreten Anwendungsbeispielen als auch auf der kritischen Reflexion digitaler geisteswissenschaftlicher Forschung. Weitere Informationen finden Sie auf der Website der BBAW.
Das Kompetenzzentrum OCR, bestehend aus der UB Tübingen und der UB Mannheim, unterstützt und berät seit drei Jahren bei der Anwendung aktueller Programme zur Texterkennung.
Für einen unkomplizierten Einstieg in das Thema bieten wir für alle Interessierten jeden zweiten Donnerstag im Monat von 15 bis 16 Uhr eine offene OCR-Sprechstunde via Zoom an, in der Sie Ihre Fragen rund um das Thema automatisierte Texterkennung stellen können.
Die nächste Sprechstunde findet am Donnerstag, dem 14. März 2024 statt.
Das Kompetenzzentrum OCR, bestehend aus der UB Tübingen und der UB Mannheim, unterstützt und berät seit drei Jahren bei der Anwendung aktueller Programme zur Texterkennung.
Für einen unkomplizierten Einstieg in das Thema bieten wir für alle Interessierten jeden zweiten Donnerstag im Monat von 15 bis 16 Uhr eine offene OCR-Sprechstunde via Zoom an, in der Sie Ihre Fragen rund um das Thema automatisierte Texterkennung stellen können.
Die nächste Sprechstunde findet am Donnerstag, dem 11. Januar 2024 statt.
Das Kompetenzzentrum OCR, bestehend aus der UB Tübingen und der UB Mannheim, unterstützt und berät seit drei Jahren bei der Anwendung aktueller Programme zur Texterkennung.
Für einen unkomplizierten Einstieg in das Thema bieten wir für alle Interessierten jeden zweiten Donnerstag im Monat von 15 bis 16 Uhr eine offene OCR-Sprechstunde via Zoom an, in der Sie Ihre Fragen rund um das Thema automatisierte Texterkennung stellen können.
Die nächste Sprechstunde findet am Donnerstag, dem 12. Oktober 2023 statt.
Das Kompetenzzentrum OCR, bestehend aus der UB Tübingen und der UB Mannheim, unterstützt und berät seit drei Jahren bei der Anwendung aktueller Programme zur Texterkennung.
Für einen unkomplizierten Einstieg in das Thema bieten wir für alle Interessierten jeden zweiten Donnerstag im Monat von 15 bis 16 Uhr eine offene OCR-Sprechstunde via Zoom an, in der Sie Ihre Fragen rund um das Thema automatisierte Texterkennung stellen können.
Die nächste Sprechstunde findet am Donnerstag, dem 14. September 2023 statt.
The Faculty of Arts and KU Leuven Libraries jointly organize a 2-day training (22/11 and 23/11) focusing on “self-made” document digitization for researchers and staff members. Different modules (that can be followed independently from each other) will teach you: (1) how to photograph archival materials to ensure the best possible quality; (2) how to create an ideal research data management (RDM) workflow for the digitized materials; (3) and how to apply Optical Character Recognition and Handwritten Text Recognition (HTR) to automatically transcribe text from images.
Practical Details
Module 1 (day one) Digitization and Research Data Management
Module 2 (day two) OCR and HTR
Date: Tuesday 22 November
Time: 10:00-16:00 (1 hour reserved for lunch on your own)
Are you a Dutch speaker who needs to transcribe old or new hand-written materials, or do optical character recognition (OCR) on print materials? Check out this upcoming webinar on Transkribus in Dutch, taking place on May 31, 2022, at 16h CEST:
Dit webinar van Dr. Annemieke Romein geeft een overzicht van de basis van Transkribus in het Nederlands. U leert hoe u documenten upload naar Transkribus, de lay-out analyse uitvoert, handmatige transcripties doet om trainingsdata te genereren, hoe u de geautomatiseerde herkenning gebruikt, welke publieke modellen we aanbieden, hoe de training van uw eigen model werkt en hoe u uw documenten kunt doorzoeken op speciale woorden en zinsdelen. We zullen de workflow stap voor stap doornemen en u krijgt de kans om vragen te stellen via de chat.
U hoeft zich niet te registreren om deel te nemen aan dit webinar (het zal ongeveer 45 minuten duren plus tijd voor vragen), u kunt er toegang toe krijgen via deze link: https://youtu.be/xe-OTS48FK
The first meeting of the spring 2022 edition of the DH Virtual Discussion Group for ECRs in Belgium kicked off on Monday 21 March with a presentation from Gianluca Valenti (University of Liège). We had a total of twenty attendees—some new faces and some familiar ones—who all contributed to an engaging conversation about digital humanities.
This session followed our standard format, which opens with a greeting from the organizers, Julie M. Birkholz (KBR and Ghent University), Margherita Fantoli (KU Leuven), and Leah Budke (KU Leuven). This is followed by our networking session where new and returning attendees can introduce themselves in a small group, tell about their interests and experiences in DH, and get to know others in the community. This networking moment also allows those of us who already know each other to catch up and enjoy a coffee or tea before the main presentation starts and to welcome new members into our community. After the networking moment, the group comes back together to share any upcoming DH events or opportunities. The main event follows, when a member of our community gives us a behind-the-scenes look at a digital project, workflow, or tool.
Gianluca’s “under-the-hood” presentation was titled “Modern Letters and Text Analysis: The ‘EpistolarITA’ Project” and discussed the importance of epistolary texts in historical research. As Gianluca explained, today there is a wealth of correspondence available to researchers, but we are still lacking adequate tools to engage with these materials to the fullest extent. The EpistolarITA project aims to fill this gap and to contribute to scholarly efforts to exploit historical epistolary texts through the development of the EpistolarITA database. The database brings together fifteenth through seventeenth century Italian letters and allows users to perform statistical analysis on this corpus. As Gianluca explained in his presentation, the database allows readers to compare a target text to the texts in the corpus. The database then has the capability to return similar texts, ranking them in order of their similarity. In order to be able to accomplish this, the algorithm uses a number of different techniques including TF-IDF, Word2Vec, and Named-Entity-Recognition. The advantage of using the database, as Gianluca demonstrated, is that it allows users to draw connections or to see patterns that they might not otherwise see. While the full text of letters is not made available due to copyright restrictions, users are still able to perform text analysis on these materials and to return results that they would otherwise not be able to achieve without many visits to the archives and the additional work that goes into creating the infrastructure which allows this type of text analysis.
The EpistolarITA database is still in the process of being populated, but the official publication is expected this spring. For now, the project site and database is entirely in Italian, but they hope to make an English translation available in the future.
If a look behind the scenes of a digital project sounds interesting to you, we would be delighted to have you join us for our next DH Virtual Discussion Group meeting on Monday 25 April from 15h-16h30 CEST! In this session, Montaine Denys from the Flanders Heritage Libraries will take us behind the scenes of the Flanders Heritage Libraries’ digitization projects. Montaine’s talk, titled “Managing the Evaluation of OCR Quality in Flemish Newspaper Collections,” will include a discussion of the project workflow, the creation of a “ground truth” dataset, interpreting results, and the specific challenges they have faced and the lessons they have learned while undertaking this project.
To join us for this session or any future sessions all you need to do is register for our mailing list. Once registered, you will receive all future emails, including the links to the Zoom meetings. These links are distributed via email the morning of the event.
The DH Virtual Discussion Group is designed to be a low-threshold way for researchers, particularly early career researchers, to come together and learn about digital humanities. Everyone is welcome to attend and absolutely no DH expertise is required. To see a full overview of this spring’s sessions, click here. If there is a session that seems of interest to you, please do join us!