普通视图

Received before yesterday美 - 弗吉尼亚大学(Scholars'Lab UVA)

The M.E. Test

2025年4月15日 12:00

I recently gave a workshop for the US Latino Digital Humanities Center (USLDH) at the University of Houston on introductory text analysis concepts and Voyant. I don’t have a full talk to share since it was a workshop, but I still thought I would share some of the things that worked especially well about the session. USLDH recorded the talk and made it available here, and you can find the link to my materials here.

I had a teaching observation when I was graduate student, and one comment always stuck with me. My director told me, “this was all great but don’t be afraid to tell them what you think.” I’ve written elsewhere about how I tend to approach classroom facilitation as a process of generating questions that the group explores together. This orientation is sometimes in conflict with DH instruction, where you have information that simply needs to be conveyed. I had this tension in mind while planning the USLDH event. It was billed as a workshop, and I think there’s nothing worse than attending a workshop only to find that it’s really a lecture. How to balance the generic expectations with the knowledge that I had stuff I needed to put on the table? As an attempt to thread this needle, I structured the three-part session around a range of different kinds of teaching moves: some lecture, yes, but also a mix of open discussion, case study, quiz questions, and free play with a tool.

The broad idea behind the workshop entitled “Book Number Graph” is that people come to text analysis consultations with all varieties of materials and a range of research questions. Most often, my first step in consulting with them is to ask them to slow down and think more deeply about their base assumptions. Do they actually have their materials in a usable form? Is it possible to ask the questions they are interested in using the evidence they have? I built the workshop discussions as though I was prepping participants to field these kinds of research consultations, as though they were digital humanities librarians.

First, the “book” portion of the workshop featured a short introduction to different kinds of materials, exploring how format matters in the context of digital text analysis. We discussed how a book is distinct from an eBook is distinct from a web material, and how all of these are really distinct from the kind of plain text document that we likely want to get to. I used here a hypothetical person who shows up in my office and says, “Oh yeah, I have my texts. I’m ready to work on them with you. Can you help me?” And they will hand me either a stack of books or a series of PDF files that haven’t been OCR’d. I introduced workshop participants to the kinds of technical and legal challenges that arise in such situations so that they’ll be able to better assess the feasibility of their own plans. This all built to a pair of case studies where I asked the participants how they would respond if a researcher came to them with questions for their own project.

First case study: I am interested in a text analysis project on medieval Spanish novels. Oh yeah I have my texts. Can I meet? What kinds of questions would you ask this person? What kinds of problems might you expect? How would you address them?

I want to study the concept of the family as discussed in online forums for Mexican-American communities. Can we meet to discuss? What kinds of questions would you ask this person? What kinds of problems might you expect? How would you address them?

With these case studies, I hoped to give participants a glimpse into the real-world kinds of conversations that I have as a DH library worker. For the most part, consultations begin with my asking a range of questions of the researcher so as to help them get new clarity on the actual feasibility of what they want to do. I hoped for the participants to question the formats of the materials for these hypothetical researchers and point out a range of ethical and legal concerns. Hopefully they would be able to ask these questions of their own work as well.

Has anyone made this available before? If yes…Can I use it? Under what terms? If not…Do I have access to the texts myself? If yes…What format are they in? If not available as plain text…Can I convert them into the format I need? What do I want to do with these texts? Is it allowed?

For the second section of the workshop entitled “number,” I gave participants an introduction to thinking about evidence and analysis, distinguishing between what computers can do and the kinds of things that readers are good at. Broadly speaking, computers are concrete. They know what’s on the page and not what’s outside of it. Researchers in text analysis need to point software to the specific things that they are interested in on the page and supplement this information with any other information outside of the text. Complicated text analysis research questions have at their core really simplistic, concrete, measurable things on the page. You are pointing to a thing and counting. For examples of the things that computers can readily be told to examine, we discussed structural information, proximity, the order of words, frequency of words, case, and more.

To practice this, I adapted an exercise that I was first introduced to by Mackenzie Brooks but that was developed by librarians at the University of Michigan. To introduce TEI, the activity asks students to draw boxes around a printed poem as a way to identify the different structural elements that you would want to encode. For my purposes, I put a Langston Hughes poem on the Zoom screen and asked participants to annotate it with all sorts of information that they thought a computer would be capable of identifying.

Langston Hughes poem ready to be annotated

The result was a beautiful tapestry of underlines and squiggles. Some of the choices would be very easy for a computer: word frequency, line breaks, structural elements. But we also talked about more challenging cases. We know the poem’s title because we expect to see it in a certain place on the page. The computer might be pointed to this this by flagging the line that comes three after three blank line breaks. But what if this isn’t always the case? It was good practice in how to distinguish between the information we bring to the text and what is actually available on the page. We talked about the challenges in trying to bridge the gap between what computers can do and what humans can do, to try and think through how a complicated intellectual question might take shape in a computationally legible form.

Kinds of things that can be measured: Sequences of characters. Case. Words (tokens). Structural elements, with some caveats. Proximity. Order (syntagmatic axis). Metadata – often has to be added manually

Wrapping all this together, I introduced what I called the M.E. test for text analysis research. To have a successful text analysis project you have to have…

MATERIALS  - Appropriate, accessible. EVIDENCE - Identifiable, measurable

  • Materials that are…
    • appropriate to your questions and
    • accessible for your purposes.

You must also have

  • Evidence that is…
    • identifiable to you as an expression of your research question and
    • legible to the tool you are using.

Materials and Evidence. M and E.

M.E.

The next time you sit down to do text analysis, ask yourself, “What makes a good question? M.E. Me!”

XKCD comic on imposter syndrome describing an expert in imposter syndrome who immediately questions her own expertise

Painfully earnest? Sure! But this was a nice little way for me to tie in what I often joke is my most frequently requested consultation topic: imposter syndrome. The M.E. question is both a test for deciding whether or not a text analysis research question is appropriate, but it is also a call for you to recognize that you can handle this work. A nice little way for you to give yourself a pump up, because I believe that these methods belong to anyone. Anyone can handle these kinds of consultations. They’re more art than science at the level we are discussing. You just have to know the correct way to approach them. Deep expertise can come later. If you are too intimidated to get started you will never get there.

From there, I closed the “number” portion of the workshop with a couple more case study prompts. I asked participants to respond to two more scenarios as though someone had just walked into their office with an idea they wanted to try out.

Prompt: I am interested in which Shakespeare character is the most important in Romeo and Juliet.

Prompt: I am interested in how space and place are represented in literature of the southeastern United States.

The hypothetical consultation prompts involved, first, an interest in finding the most important characters in a particular Shakespeare play and, second, an interest in space and place in southeastern American literature. In each case, we discussed questions of format and copyright, but we also got to some fairly high-level questions about what kinds of evidence you could use to discuss the research questions. For importance, participants proposed measuring either number of lines for each character or who happens to be onstage for the greatest amount of time. For space and place, we discussed counting place names using Python (a nice way to introduce concepts related to Named Entity Recognition). In each case, my goal was to give the workshop participants a sense of how to test and develop their own research questions by walking them through the process I use when talking with researchers asking for a fresh consultation.

USLDH has shared the recording link, so feel free to check out the recording if you want to see the activities in action. The slides can be found here. And never forget the most important thing to ask yourself the next time you’re working on a text analysis problem:

“What makes a good research question? Me.”

What to make of my high school math average (or, 0.25/20 is not so bad)

2024年9月24日 12:00

When I was 16, I burned my math exams in a bonfire. I remember holding my last ever math exam in front of my friends, on which a 0.25/20 was marked in bright-red ink, and throwing it in the fire. Feeling a rush of excitement, realizing that I will never have to endure math classes ever again. I would never have to be singled-out by my math teacher for being the worst student of the class, probably of the year, potentially of his career, ever again. Now, I look back at my math years with a more acute sense of how coming from an underprivileged background where no one monitors your homework (and checks if you successfully learnt your times table) and how internalizing a gendered form of knowledge from a very early age (you are a girl you will be drawn to humanities) is a recipe – dare I say the components of an algorithm – for mathematical disaster.

When I applied to Praxis, I was fully aware that being awarded the fellowship would be the first step of a healing journey (as dramatic as it might sound), a healing journey in which band-aids have numbers on them, and not just the fathomable computer binary 0 and 1, but also the mean-looking ones, with squared numbers and exponential functions. Praxis would mean confronting myself to coding, which would require confronting myself, to a certain extent, to mathematics. It feels as though Scholar’s Lab people have now become experts in “teaching the math basics you will need to understand for you to engage in coding” to Humanities people with a varying degree of proficiency in arithmetic. From Shane’s goofy-looking dog Rocky on the first slide of the history and genealogy of computing to constant reassurance, we were presented with a progressive complexity which made our first assignment, “write out in plain English an algorithm to sort a deck of cards” a funny and appealing game.

Now, I have to be honest and confess that I cried on my way out of the Scholar’s Lab, after this first “Introduction to Data” session. Not because someone said something wrong or made me feel bad – of course not. But because in front of this whiteboard on which were written so many numbers, I felt myself going back in time ten years earlier, blankly staring at the whiteboard in my math class, not understanding a single thing. Not because I did not want to (or perhaps unconsciously), but because I was utterly unable to comprehend what was going on. As if I was stuck in a fever dream where whatever was written down felt like a language from outer space and where someone would just keep repeating “how can you not understand this?”.

Then, I remembered the “So you want to be a wizard?” zine that Shane handed out and had us read, and its writer Julia Evans’s positive reframing of difficulty. In this programming zine, she presents bugs as learning opportunities. Bob Ross would have added – “happy accidents”. Somehow, crying after this “Introduction to Data” was a personal necessity. I needed to get my math trauma out of the way, and the deep feelings of shame, guilt, and incompetence that have been hindering me for years. I have no illusion as I know I won’t become Ada Lovelace, Elizabeth Smith Friedman or Mavis Batey – I will still be bad at math, because my brain must have rewired itself differently. But now that we are being invited to learn, fail and learn from apparent failure, I know that I will hold my head high up and try, fail, learn and try again, differently. Praxis has allowed me to move on and make peace with the teenager in me who still feels the burning shame of being the last at something. Now, I can tell her that a bad math average makes for the best potential for growth. 0.25/20 is not so bad.

Scraping a webpage’s list of linked files using wget

2024年7月10日 12:00

I want to apply some text analysis tools to explore questions around a set of podcast interviews. There’s a webpage that lists links to transcripts of these interviews, one link per podcast episode text file. Because there are many episodes (over a 100?), I don’t want to manually click each link to download the episode’s transcript file.

Instead, I followed a Programming Historian lesson by Ian Milligan about the command-line utility wget. The lesson helped me understand how to customize wget’s options so it downloads each transcript file for me into a convenient folder, without overloading the website’s servers.

Here’s the command I used (after installing a couple things that Milligan’s lesson walks you through):

wget -r -l 3 ––random-wait -w 10 --limit-rate=20k [URL of folder containing desired files to download]

That command consists of the tool name (wget), a bunch of options modifying how the tool downloads files, and the URL you want to be downloading from. The options I chose to fit my particular webpage of interest:

-r says to follow links on the URL I provide to other links

-l 3 says to follow each link to 3 pages away from the initial URL I provided

-w 20 adds a 20 second wait between server requests

––random-wait was in response to my initial wget attempts producing a “ERROR 429: Too Many Requests.” message and not downloading files; it varies the wait time by 0.5 to 1.5 times the length provided with the -w 10 option above

--limit-rate=20k sets the maximum download speed to 20kb/s to be nice to the site’s bandwidth (initially tried -w 2; that allowed downloading of ~30 files then ran into 429 error again)

The files are now downloading in the background! My next step will be using another command-line utility, pandoc, to convert the transcript files from one file type (MS Word) to another file type friendlier to text analysis.

If you’re interested in automating downloads rather than manually clicking-saving a bunch, you might check out my post from 5(!) years ago on automating taking screenshots of webpages using a list of URLs (which I used to get a folder of screenshots of all my faved tweets).

If you’re interested but have never used the command line, Programming Historian has a peer-reviewed, free online tutorial aimed at humanities/cultural heritage folks who want to learn command line use. I highly recommend PH’s website; not only is every lesson created with communal care (author[s], editor, multiple reviewers), the lessons are aimed at humanities-ish folks (the things that might interest you, the things you might be excited to learn how to do with code), and the lessons are written in a very novice-friendly style (no assumptions you already know things; or the advanced lessons point you to the earlier lessons you’ll need to complete before you can comfortably follow them).

❌