普通视图

Received before yesterday学术机构(海外高校)

ADHC Talks Podcast: A Conversation with Vincent Scalfani (5.2)

作者adhcadmin
2025年11月18日 03:19

Description

Our guest today is Dr. Vincent F. Scalfani. Vincent serves as the director of research computing services at the University of Alabama Libraries, providing leadership and support for the newly evolving research computing services across the disciplines. Additionally, he serves as subject liaison for chemical sciences and mathematics. Before joining the University of Alabama in 2012, he earned a PhD in chemistry from Colorado State University.

His research interests include chemical information and cheminformatics. Today, we’re going to be talking about a project that has been ongoing, a project that I am just absolutely in love with. It’s the University of Alabama Libraries Scholarly API Cookbook, which is an open-access online book featuring concise code examples or recipes that illustrate how to interact with various scholarly web service APIs.

These APIs enable researchers to automate search queries, customize data sets, and more easily integrate their information workflows into downstream data analysis processes. Launched in 2022, the cookbook is continually enhanced and updated by student programmers at the libraries. Vin and I have been on faculty here at the university libraries for about 13 years.

Season: 5

Episode: 2

Date: 3/2025

Presenter: Vincent Scalfani

Topic: Scholarly API Cookbook

Tags: Coding; Scholarly API research; research computing

The post ADHC Talks Podcast: A Conversation with Vincent Scalfani (5.2) appeared first on Alabama Digital Humanities Center.

The M.E. Test

2025年4月15日 12:00

I recently gave a workshop for the US Latino Digital Humanities Center (USLDH) at the University of Houston on introductory text analysis concepts and Voyant. I don’t have a full talk to share since it was a workshop, but I still thought I would share some of the things that worked especially well about the session. USLDH recorded the talk and made it available here, and you can find the link to my materials here.

I had a teaching observation when I was graduate student, and one comment always stuck with me. My director told me, “this was all great but don’t be afraid to tell them what you think.” I’ve written elsewhere about how I tend to approach classroom facilitation as a process of generating questions that the group explores together. This orientation is sometimes in conflict with DH instruction, where you have information that simply needs to be conveyed. I had this tension in mind while planning the USLDH event. It was billed as a workshop, and I think there’s nothing worse than attending a workshop only to find that it’s really a lecture. How to balance the generic expectations with the knowledge that I had stuff I needed to put on the table? As an attempt to thread this needle, I structured the three-part session around a range of different kinds of teaching moves: some lecture, yes, but also a mix of open discussion, case study, quiz questions, and free play with a tool.

The broad idea behind the workshop entitled “Book Number Graph” is that people come to text analysis consultations with all varieties of materials and a range of research questions. Most often, my first step in consulting with them is to ask them to slow down and think more deeply about their base assumptions. Do they actually have their materials in a usable form? Is it possible to ask the questions they are interested in using the evidence they have? I built the workshop discussions as though I was prepping participants to field these kinds of research consultations, as though they were digital humanities librarians.

First, the “book” portion of the workshop featured a short introduction to different kinds of materials, exploring how format matters in the context of digital text analysis. We discussed how a book is distinct from an eBook is distinct from a web material, and how all of these are really distinct from the kind of plain text document that we likely want to get to. I used here a hypothetical person who shows up in my office and says, “Oh yeah, I have my texts. I’m ready to work on them with you. Can you help me?” And they will hand me either a stack of books or a series of PDF files that haven’t been OCR’d. I introduced workshop participants to the kinds of technical and legal challenges that arise in such situations so that they’ll be able to better assess the feasibility of their own plans. This all built to a pair of case studies where I asked the participants how they would respond if a researcher came to them with questions for their own project.

First case study: I am interested in a text analysis project on medieval Spanish novels. Oh yeah I have my texts. Can I meet? What kinds of questions would you ask this person? What kinds of problems might you expect? How would you address them?

I want to study the concept of the family as discussed in online forums for Mexican-American communities. Can we meet to discuss? What kinds of questions would you ask this person? What kinds of problems might you expect? How would you address them?

With these case studies, I hoped to give participants a glimpse into the real-world kinds of conversations that I have as a DH library worker. For the most part, consultations begin with my asking a range of questions of the researcher so as to help them get new clarity on the actual feasibility of what they want to do. I hoped for the participants to question the formats of the materials for these hypothetical researchers and point out a range of ethical and legal concerns. Hopefully they would be able to ask these questions of their own work as well.

Has anyone made this available before? If yes…Can I use it? Under what terms? If not…Do I have access to the texts myself? If yes…What format are they in? If not available as plain text…Can I convert them into the format I need? What do I want to do with these texts? Is it allowed?

For the second section of the workshop entitled “number,” I gave participants an introduction to thinking about evidence and analysis, distinguishing between what computers can do and the kinds of things that readers are good at. Broadly speaking, computers are concrete. They know what’s on the page and not what’s outside of it. Researchers in text analysis need to point software to the specific things that they are interested in on the page and supplement this information with any other information outside of the text. Complicated text analysis research questions have at their core really simplistic, concrete, measurable things on the page. You are pointing to a thing and counting. For examples of the things that computers can readily be told to examine, we discussed structural information, proximity, the order of words, frequency of words, case, and more.

To practice this, I adapted an exercise that I was first introduced to by Mackenzie Brooks but that was developed by librarians at the University of Michigan. To introduce TEI, the activity asks students to draw boxes around a printed poem as a way to identify the different structural elements that you would want to encode. For my purposes, I put a Langston Hughes poem on the Zoom screen and asked participants to annotate it with all sorts of information that they thought a computer would be capable of identifying.

Langston Hughes poem ready to be annotated

The result was a beautiful tapestry of underlines and squiggles. Some of the choices would be very easy for a computer: word frequency, line breaks, structural elements. But we also talked about more challenging cases. We know the poem’s title because we expect to see it in a certain place on the page. The computer might be pointed to this this by flagging the line that comes three after three blank line breaks. But what if this isn’t always the case? It was good practice in how to distinguish between the information we bring to the text and what is actually available on the page. We talked about the challenges in trying to bridge the gap between what computers can do and what humans can do, to try and think through how a complicated intellectual question might take shape in a computationally legible form.

Kinds of things that can be measured: Sequences of characters. Case. Words (tokens). Structural elements, with some caveats. Proximity. Order (syntagmatic axis). Metadata – often has to be added manually

Wrapping all this together, I introduced what I called the M.E. test for text analysis research. To have a successful text analysis project you have to have…

MATERIALS  - Appropriate, accessible. EVIDENCE - Identifiable, measurable

  • Materials that are…
    • appropriate to your questions and
    • accessible for your purposes.

You must also have

  • Evidence that is…
    • identifiable to you as an expression of your research question and
    • legible to the tool you are using.

Materials and Evidence. M and E.

M.E.

The next time you sit down to do text analysis, ask yourself, “What makes a good question? M.E. Me!”

XKCD comic on imposter syndrome describing an expert in imposter syndrome who immediately questions her own expertise

Painfully earnest? Sure! But this was a nice little way for me to tie in what I often joke is my most frequently requested consultation topic: imposter syndrome. The M.E. question is both a test for deciding whether or not a text analysis research question is appropriate, but it is also a call for you to recognize that you can handle this work. A nice little way for you to give yourself a pump up, because I believe that these methods belong to anyone. Anyone can handle these kinds of consultations. They’re more art than science at the level we are discussing. You just have to know the correct way to approach them. Deep expertise can come later. If you are too intimidated to get started you will never get there.

From there, I closed the “number” portion of the workshop with a couple more case study prompts. I asked participants to respond to two more scenarios as though someone had just walked into their office with an idea they wanted to try out.

Prompt: I am interested in which Shakespeare character is the most important in Romeo and Juliet.

Prompt: I am interested in how space and place are represented in literature of the southeastern United States.

The hypothetical consultation prompts involved, first, an interest in finding the most important characters in a particular Shakespeare play and, second, an interest in space and place in southeastern American literature. In each case, we discussed questions of format and copyright, but we also got to some fairly high-level questions about what kinds of evidence you could use to discuss the research questions. For importance, participants proposed measuring either number of lines for each character or who happens to be onstage for the greatest amount of time. For space and place, we discussed counting place names using Python (a nice way to introduce concepts related to Named Entity Recognition). In each case, my goal was to give the workshop participants a sense of how to test and develop their own research questions by walking them through the process I use when talking with researchers asking for a fresh consultation.

USLDH has shared the recording link, so feel free to check out the recording if you want to see the activities in action. The slides can be found here. And never forget the most important thing to ask yourself the next time you’re working on a text analysis problem:

“What makes a good research question? Me.”

“We got the data – what now?”: Text Mining

2024年10月5日 01:10
Datenflut im digitalen Zeitalter – auch die Geisteswissenschaften stehen zunehmend vor der Herausforderung, riesige Datenmengen zu bewältigen. Doch wie können diese Daten sinnvoll genutzt werden, um neue Forschungsperspektiven zu eröffnen? Die neue Workshop-Reihe „We...

What to make of my high school math average (or, 0.25/20 is not so bad)

2024年9月24日 12:00

When I was 16, I burned my math exams in a bonfire. I remember holding my last ever math exam in front of my friends, on which a 0.25/20 was marked in bright-red ink, and throwing it in the fire. Feeling a rush of excitement, realizing that I will never have to endure math classes ever again. I would never have to be singled-out by my math teacher for being the worst student of the class, probably of the year, potentially of his career, ever again. Now, I look back at my math years with a more acute sense of how coming from an underprivileged background where no one monitors your homework (and checks if you successfully learnt your times table) and how internalizing a gendered form of knowledge from a very early age (you are a girl you will be drawn to humanities) is a recipe – dare I say the components of an algorithm – for mathematical disaster.

When I applied to Praxis, I was fully aware that being awarded the fellowship would be the first step of a healing journey (as dramatic as it might sound), a healing journey in which band-aids have numbers on them, and not just the fathomable computer binary 0 and 1, but also the mean-looking ones, with squared numbers and exponential functions. Praxis would mean confronting myself to coding, which would require confronting myself, to a certain extent, to mathematics. It feels as though Scholar’s Lab people have now become experts in “teaching the math basics you will need to understand for you to engage in coding” to Humanities people with a varying degree of proficiency in arithmetic. From Shane’s goofy-looking dog Rocky on the first slide of the history and genealogy of computing to constant reassurance, we were presented with a progressive complexity which made our first assignment, “write out in plain English an algorithm to sort a deck of cards” a funny and appealing game.

Now, I have to be honest and confess that I cried on my way out of the Scholar’s Lab, after this first “Introduction to Data” session. Not because someone said something wrong or made me feel bad – of course not. But because in front of this whiteboard on which were written so many numbers, I felt myself going back in time ten years earlier, blankly staring at the whiteboard in my math class, not understanding a single thing. Not because I did not want to (or perhaps unconsciously), but because I was utterly unable to comprehend what was going on. As if I was stuck in a fever dream where whatever was written down felt like a language from outer space and where someone would just keep repeating “how can you not understand this?”.

Then, I remembered the “So you want to be a wizard?” zine that Shane handed out and had us read, and its writer Julia Evans’s positive reframing of difficulty. In this programming zine, she presents bugs as learning opportunities. Bob Ross would have added – “happy accidents”. Somehow, crying after this “Introduction to Data” was a personal necessity. I needed to get my math trauma out of the way, and the deep feelings of shame, guilt, and incompetence that have been hindering me for years. I have no illusion as I know I won’t become Ada Lovelace, Elizabeth Smith Friedman or Mavis Batey – I will still be bad at math, because my brain must have rewired itself differently. But now that we are being invited to learn, fail and learn from apparent failure, I know that I will hold my head high up and try, fail, learn and try again, differently. Praxis has allowed me to move on and make peace with the teenager in me who still feels the burning shame of being the last at something. Now, I can tell her that a bad math average makes for the best potential for growth. 0.25/20 is not so bad.

Training: Online Workshops Offered by KU Leuven ICTS on Excel, LaTeX and Python

2024年8月12日 17:23

This fall, KU Leuven ICTS is offering a selection of online workshops focused on various softwares for working with data. If you have been hoping to learn more about Excel for use with quantitative data, LaTeX for more flexibility when it comes to the format of your academic writing, or Python for more advanced data science techniques (workshop requires knowledge of a previous programming language such as R), then you might be interested in one of the following workshops!

Excel – Basics module 1 (online)

  • All info here
  • What? By means of practical examples you will quickly become familiar with the basic techniques of Excel: Input, Editing, Formatting, Simple calculations.
  • For whom? Anyone who is interested, regardless of their statute (PhD student, postdoc, scientific collaborator..). No prior knowledge of Excel required, but some experience with other Office programmes (Word, Outlook) comes in handy.
  • Language: English
  • By whom? KU Leuven central ICTS trainers
  • When & where? Online via Teams, 2 half days: 14/11/2024: 9 a.m.-12.30 p.m. & 15/11/2024: 9 a.m.-12.30 p.m. – 19 places left!
  • How much does it cost? It’s free of charge.
  • How can I register? Via KU Loket, see workshop website.
  • PS – For PhD students this counts for the requirement of minimum 12 hours of transferable skills trainingMore info here.

LaTeX introduction (online)

  • All info here
  • What? This introduction will teach you how to use an editor (TexnicCenter), create, compile and print a basic LaTeX document.
  • For whom? Anyone who is interested, regardless of their statute (PhD student, postdoc, scientific collaborator..)
  • Language: English
  • By whom? KU Leuven central ICTS trainers
  • When & where? Online via Teams, 2 half days: 20/11/2024: 9 a.m.-13.00 p.m. & 21/11/2024: 9 a.m.-13.00 p.m. – 13 places left!
  • How much does it cost? It’s free of charge.
  • How can I register? Via KU Loket, see workshop website.
  • PS – For PhD students this counts for the requirement of minimum 12 hours of transferable skills trainingMore info here.

Python as a second language (online)

  • All info here
  • Please note that there are also several other Python courses, all of which require previous experience with PythonPython for data sciencePython for HPCPython for machine learningPython-on-GPUsScientific Python.
  • What? This training session introduces the programming language to participants who have programming experience with other programming languages such as R, MATLAB, C/C++ or Fortran.
  • For whom? Anyone who is interested and who already has experience in another programming language (e.g. R).
  • Language: English
  • By whom? KU Leuven central ICTS trainers
  • When & where? Online via Teams, 2 half days: 23/10/2024: 9 a.m.-12 p.m. & 24/10/2024: 9 a.m.-12 p.m. – 14 places left!
  • How much does it cost? It’s free of charge.
  • How can I register? Via KU Loket, see workshop website.
  • PS – For PhD students this counts for the requirement of minimum 12 hours of transferable skills trainingMore info here.

Scraping a webpage’s list of linked files using wget

2024年7月10日 12:00

I want to apply some text analysis tools to explore questions around a set of podcast interviews. There’s a webpage that lists links to transcripts of these interviews, one link per podcast episode text file. Because there are many episodes (over a 100?), I don’t want to manually click each link to download the episode’s transcript file.

Instead, I followed a Programming Historian lesson by Ian Milligan about the command-line utility wget. The lesson helped me understand how to customize wget’s options so it downloads each transcript file for me into a convenient folder, without overloading the website’s servers.

Here’s the command I used (after installing a couple things that Milligan’s lesson walks you through):

wget -r -l 3 ––random-wait -w 10 --limit-rate=20k [URL of folder containing desired files to download]

That command consists of the tool name (wget), a bunch of options modifying how the tool downloads files, and the URL you want to be downloading from. The options I chose to fit my particular webpage of interest:

-r says to follow links on the URL I provide to other links

-l 3 says to follow each link to 3 pages away from the initial URL I provided

-w 20 adds a 20 second wait between server requests

––random-wait was in response to my initial wget attempts producing a “ERROR 429: Too Many Requests.” message and not downloading files; it varies the wait time by 0.5 to 1.5 times the length provided with the -w 10 option above

--limit-rate=20k sets the maximum download speed to 20kb/s to be nice to the site’s bandwidth (initially tried -w 2; that allowed downloading of ~30 files then ran into 429 error again)

The files are now downloading in the background! My next step will be using another command-line utility, pandoc, to convert the transcript files from one file type (MS Word) to another file type friendlier to text analysis.

If you’re interested in automating downloads rather than manually clicking-saving a bunch, you might check out my post from 5(!) years ago on automating taking screenshots of webpages using a list of URLs (which I used to get a folder of screenshots of all my faved tweets).

If you’re interested but have never used the command line, Programming Historian has a peer-reviewed, free online tutorial aimed at humanities/cultural heritage folks who want to learn command line use. I highly recommend PH’s website; not only is every lesson created with communal care (author[s], editor, multiple reviewers), the lessons are aimed at humanities-ish folks (the things that might interest you, the things you might be excited to learn how to do with code), and the lessons are written in a very novice-friendly style (no assumptions you already know things; or the advanced lessons point you to the earlier lessons you’ll need to complete before you can comfortably follow them).

Training: Digital Skills Space

2022年11月3日 21:51

Do you find yourself wishing you could learn a new digital skill like R or Python, but just cannot seem to carve out the time in your schedule? Do you get started but then find yourself stuck and in need of advice? If so, the Digital Skills Space might be a good fit for you!

The Digital Skills Space is a regular space to learn, practice and share tips to work with digital data and enhance reproducible research. The idea is to carve out a space in busy schedules for investing in digital skills, from managing and processing data, to exploring it with tables and plots, to reporting it in a reliable and reproducible way. 

 If you would like to see what has been covered in previous sessions or read more about the initiative, you can visit the DigiSkills website.

The Digital Skills Space is organized within the Linguistics Department of the Faculty of Arts, KU Leuven by Mariana Montes

Sessions take place every Friday from 11:00 to 13:00.

❌