阅读视图

Interesting digital humanities data sources

I bookmark sources of data that seem interesting for digital humanities teaching and research:

  • showing humanists what data & datafication in their fields can look like
  • having interesting examples when teaching data-using tools
  • trying out new data tools

I’m focusing on sharing bookmarks with data that’s already in spreadsheet or similar structured format, rather than e.g.

  • collections of digitized paper media also counting as data and worth exploring, like Josh Begley’s racebox.org, which links to full PDFs of US Census surveys re:race and ethnicity over the years; or
  • 3D data, like my colleague Will Rourk’s on historic architecture and artifacts, including a local Rosenwald School and at-risk former dwellings of enslaved people

Don’t forget to cite datasets you use (e.g. build on, are influenced by, etc.)!

And if you’re looking for community, the Journal of Open Humanities Data is celebrating its 10th anniversary with a free, global virtual event on 9/26 including “lightning talks, thematic dialogues, and community discussions on the future of open humanities data”.

Data is being destroyed

U.S. fascists have destroyed or put barriers around a significant amount of public data in just the last 8 months. Check out Laura Guertin’s “Data, Interrupted” quilt blog post, then the free DIY Web Archiving zine by me, Quinn Dombrowski, Tessa Walsh, Anna Kijas, and Ilya Kreymer for a novice-friendly guide to helping preserve the pieces of the Web you care about (and why you should do it rather than assuming someone else will). The Data Rescue project is a collaborative project meant “to serve as a clearinghouse for data rescue-related efforts and data access points for public US governmental data that are currently at risk. We want to know what is happening in the community so that we can coordinate focus. Efforts include: data gathering, data curation and cleaning, data cataloging, and providing sustained access and distribution of data assets.”

Interesting datasets

The Database of African American and Predominantly White American Literature Anthologies

By Amy Earhart

“Created to test how we categorize identities represented in generalist literature anthologies in a database and to analyze the canon of both areas of literary study. The dataset creation informs the monograph Digital Literary Redlining: African American Anthologies, Digital Humanities, and the Canon (Earhart 2025). It is a highly curated small data project that includes 267 individual anthology volumes, 107 editions, 319 editors, 2,844 unique individual authors, and 22,392 individual entries, and allows the user to track the shifting inclusion and exclusion of authors over more than a hundred-year period. Focusing on author inclusion, the data includes gender and race designations of authors and editors.”

National UFO Reporting Center: “Tier 1” sighting reports

Via Ronda Grizzle, who uses this dataset when teaching Scholars’ Lab graduate Praxis Fellows how to shape research questions matching available data, and how to understand datasets as subjective and choice-based. I know UFOs sounds like a funny topic, and it can be, but there are also lots of interesting inroads like the language people use reflecting hopes, fears, imagination, otherness, certainty. A good teaching dataset given there aren’t overly many fields per report, and those include mappable, timeline-able, narrative text, and a very subjective interesting one (a taxonomy of UFO shapes). nuforc.org/subndx/?id=highlights

The Pudding

Well researched, contextualized, beautifully designed data storytelling on fun or meaningful questions, with an emphasis on cultural data and how to tell stories with data (including personally motivated ones, something that I think is both inspiring for students and great to have examples of how to do critically). pudding.cool

…and its Ham4Corpus use

Shirley Wu for The Pudding’s interactive visualization of every line in Hamilton uses my ham4corpus dataset (and data from other sources), which might be a useful example of how an afternoon’s work with open-access data (Wikipedia, lyrics) and some simple scripted data cleaning and formatting can produce foundations for research and visualization.

Responsible Datasets in Context

Dirs. Sylvia Fernandez, Miriam Posner, Anna Preus, Amardeep Singh, & Melanie Walsh

“Understanding the social and historical context of data is essential for all responsible data work. We host datasets that are paired with rich documentation, data essays, and teaching resources, all of which draw on context and humanities perspectives and methods. We provide models for responsible data curation, documentation, story-telling, and analysis.” 4 rich dataset options (as of August 2025) each including a data essay, ability to explore the data on the site, programming and discussion exercises for investigating and understanding the data. Datasets: US National park visit data, gender violence at the border, early 20th-century ~1k poems from African American periodicals, top 500 “greatest” novels according to OCLC records on novels most held by libraries. responsible-datasets-in-context.com

Post45 Data Collective

Eds Melanie Walsh, Alexander Manshel, J.D. Porter

“A peer-reviewed, open-access repository for literary and cultural data from 1945 to the present”, offering 11 datasets (as of August 2025) useful in investigations such as how book popularity & literary canons get manufactured. Includes datasets on “The Canon of Asian American Literature”, “International Bestsellers”, “Time Horizons of Futuristic Fiction”, and “The Index of Major Literary Prizes in the US”. The project ‘provides an open-access home for humanities data, peer reviews data so scholars can gain institutional recognition, and DOIs so this work can be cited’: data.post45.org/our-data.html

CBP and ICE databases

Via Miriam Posner: A spreadsheet containing all publicly available information about CBP and ICE databases, from the American Immigration Council americanimmigrationcouncil.org/content-understanding-immigration-enforcement-databases

Data assignment in The Critical Fan Toolkit

By Cara Marta Messina

Messina’s project (which prioritizes ethical critical studies of fan works and fandom) includes this model teaching assignment on gathering and analyzing fandom data, and understanding the politics of what is represented by this data. Includes links to 2 data sources, as well as Destination Toast’s “How do I find/gather data about the ships in my fandom on AO3?”.

(Re:fan studies, note that there is/was an Archive of Our Own dataset—but it was created in a manner seen as invasive and unethical by AO3 writers and readers. Good to read about and discuss with students, but I do not recommend using it as a data source for those reasons.)

Fashion Calendar data

By Fashion Institute of Technology

Fashion Calendar was “an independent, weekly periodical that served as the official scheduling clearinghouse for the American fashion industry” 1941 to 2014; 1972-2008’s Fashion International and 1947-1951’s Home Furnishings are also included in the dataset. Allows manipulation on the site (including graping and mapping) as well as download as JSON. fashioncalendar.fitnyc.edu/page/data

Black Studies Dataverse

With datasets by Kenton Ramsby et al.

Found via Kaylen Dwyer. “The Black Studies Dataverse contains various quantitative and qualitative datasets related to the study of African American life and history that can be used in Digital Humanities research and teaching. Black studies is a systematic way of studying black people in the world – such as their history, culture, sociology, and religion. Users can access the information to perform analyses of various subjects ranging from literature, black migration patterns, and rap music. In addition, these .csv datasets can also be transformed into interactive infographics that tell stories about various topics in Black Studies. “ dataverse.tdl.org/dataverse/uta-blackstudies

Netflix Movies & Shows

kaggle.com/datasets/shivamb/netflix-shows

Billboard Hot 100 Number Ones Database

By Chris Dalla Riva

Via Alex Selby-Boothroyd: Gsheet by Chris Dalla Riva with 100+ data fields for every US Billboard Hot 100 Number One song since August 4th, 1958.

Internet Broadway Database

Found via Heather Froehlich: “provides data, publishes charts and structured tables of weekly attendance and ticket revenue, additionally available for individual shows”. ibdb.com

Structured Wikipedia Dataset

Wikimedia released this dataset sourced from their “Snapshot API which delivers bulk database dumps, aka snapshots, of Wikimedia projects—in this case, Wikipedia in English and French languages”. “Contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files using a consistent schema compressed as zip” huggingface.co/datasets/wikimedia/structured-wikipedia. Do note there has been controversy in the past around Hugging Face scraping material for AI/dataset use without author permission, and differing understandings of how work published in various ways on the web is owned. (I might have a less passive description of this if I went and reminded myself what happened, but I’m not going to do that right now.)

CORGIS: The Collection of Really Great, Interesting, Situated Datasets project

By Austin Cory Bart, Dennis Kafura, Clifford A. Shaffer, Javier Tibau, Luke Gusukuma, Eli Tilevich

Visualizer and exportable datasets of a lot of interesting datasets on all kinds of topics.

FiveThirtyEight’s data

I’m not a fan for various reasons, but their data underlying various political, sports, and other stats-related articles might still be useful: [data.fivethirtyeight.com(https://data.fivethirtyeight.com/) Or look at how and what they collect, include in their data and what subjective choices and biases those reveal :)

Zine Bakery zines

I maintain a database of info on hundreds of zines related to social justice, culture, and/or tech topics for my ZineBakery.com project—with over 60 metadata fields (slightly fewer for the public view) capturing descriptive and evaluative details about each zine. Use the … icon then “export as CSV” to use the dataset (I haven’t tried this yet, so let me know if you encounter issues).

OpenAlex

I don’t know much about this yet, but it looked cool and is from a non-profit that builds tools to help with the journal racket (Unsub for understanding “big deals” values and alternatvies, Unpaywall for OA article finding). “We index over 250M scholarly works from 250k sources, with extra coverage of humanities, non-English languages, and the Global South. We link these works to 90M disambiguated authors and 100k institutions, as well as enriching them with topic information, SDGs, citation counts, and much more. Export all your search results for free. For more flexibility use our API or even download the whole dataset. It’s all CC0-licensed so you can share and reuse it as you like!” openalex.org

Bonus data tools, tutorials

Matt Lincoln’s salty: “When teaching students how to clean data, it helps to have data that isn’t too clean already. salty offers functions for “salting” clean data with problems often found in datasets in the wild, such as pseudo-OCR errors, inconsistent capitalization and spelling, invalid dates, unpredictable punctuation in numeric fields, missing values or empty strings”.

The Data-Sitters Club for smart, accessible, fun tutorials and essays on computational text analysis for digital humanities.

Claudia Berger’s blog post on designing a data physicalization—a data quilt!—as well as the final quilt and free research zine exploring the data, its physicalization process, and its provocations.

The Pudding’s resources for learning & doing data journalism and research

See also The Critical Fan Toolkit by Cara Marta Messina (discussed in datasets section above), which offers both tools and links to interesting datasets.

Letterpress data, not publicly available yet…

I maintain a database of the letterpress type, graphic blocks/cuts, presses, supplies, and books related to book arts owned by me or by Scholars’ Lab. I have a very-in-progress website version I’m slowly building, without easily downloadable data, just a table view of some of the fields.

I also have a slice of this viewable online and not as downloadable data: just a gallery of the queerer letterpress graphic blocks I’ve collected or created. But I could get more online if anyone was interested in teaching or otherwise working with it?

I also am nearly done developing a database of the former VA Center for the Book: Book Arts Program’s enormous collection of type, which includes top-down photos of each case of type. I’m hoping to add more photos of example prints that use each type, too. If this is of interest to your teaching or research, let me know, as external interest might motivate me to get to the point of publishing sooner.

  •  

Designing a Data Physicalization: A love letter to dot grid paper

Claudia Berger is our Virtual Artist-in-Residence 2024-2025; register for their April 15th virtual talk and a local viewing of their data quilt in the Scholars’ Lab Common Room!

This year I am the Scholars’ Lab’s Virtual Artist-in-Residence, and I’m working on a data quilt about the Appalachian Trail. I spent most of last semester doing the background research for the quilt and this semester I get to actually start working on the quilt itself! Was this the best division of the project, maybe not. But it is what I could do, and I am doing everything I can to get my quilt to the Lab by the event in April. I do work best with a deadline, so let’s see how it goes. I will be documenting the major steps in this project here on the blog.

Data or Design first?

This is often my biggest question, where do I even start? I can’t start the design until I know what data I have. But I also don’t know how much data I need until I do the design. It is really easy to get trapped in this stage, which may be why I didn’t start actively working on this part of the project until January. It can be daunting.

N.B. For some making projects this may not apply because the project might be about a particular dataset or a particular design. I started with a question though, and needed to figure out both.

However, like many things in life, it is a false binary. You don’t have to fully get one settled before tackling the other, go figure. I came up with a design concept, a quilt made up of nine equally sized blocks in a 3x3 grid. Then I just needed to find enough data to go into nine visualizations. I made a list of the major themes I was drawn to in my research and went about finding some data that could fall into these categories.

A hand-written list about a box divided into nine squares, with the following text: AT Block Ideas: demographics, % land by state, Emma Gatewood, # miles, press coverage, harassment, Shenandoh, displacements, visit data, Tribal/Indig data, # of tribes, rights movements, plants on trail, black thru-hikers
What my initial planning looks like.

But what about the narrative?

So I got some data. It wasn’t necessarily nine datasets for each of the quilt blocks but it was enough to get started. I figured I could get started on the design and then see how much more I needed, especially since some of my themes were hard to quantify in data. But as I started thinking about the layout of the quilt itself I realized I didn’t know how I wanted people to “read” the quilt.

Would it be left to right and top down like how we read text (in English)?

A box divided into 9 squares numbered from left to write and top to bottom:  
1, 2, 3  
4, 5, 6  
7, 8, 9

Or in a more boustrophedon style, like how a river flows in a continuous line?

A box divided into 9 squares numbered from left to write and top to bottom: 1, 2, 3; 6, 5, 4; 7, 8, 9

Or should I make it so it can be read in any order and so the narrative makes sense with all of its surrounding blocks? But that would make it hard to have a companion zine that was similarly free-flowing.

So instead, I started to think more about quilts and ways narrative could lend itself to some traditional layouts. I played with the idea of making a large log cabin quilt. Log cabin patterns create a sort of spiral, they are built starting with the center with pieces added to the outside. This is a pattern I’ve used in knitting and sewing before, but not in data physicalizations.

A log cabin quilt plan, where each additional piece builds off of the previous one.
A template for making a log cabin quilt block by Nido Quilters

What I liked most about this idea is it has a set starting point in the center, and as the blocks continue around the spiral they get larger. Narratively this let me start with a simpler “seed” of the topic and keep expanding to more nuanced visualizations that needed more space to be fully realized. The narrative gets to build in a more natural way.

A plan for log cabin quilt. The center is labeled 1, the next piece (2) is below it, 3 is to the right of it, 4 is on the top, and 5 is on the side. Each piece is double the size of the previous one (except 2, which is the same size as 1).

So while I had spent time fretting about starting with either data/the design of the visualizations, what I really needed to think through first was what is the story I am trying to tell? And how can I make the affordances of quilt design work with my narrative goals?

I make data physicalizations because it prioritizes narrative and interpretation more than the “truth” of the data, and I had lost that as I got bogged down in the details. For me, narrative is first. And I use the data and the design to support the narrative.

Time to sketch it out

This is my absolute favorite part of the whole process. I get to play with dot grid paper and all my markers, what’s not to love? Granted, I am a stationery addict at heart. So I really do look for any excuse to use all of the fun materials I have. But this is the step where I feel like I get to “play” the most. While I love sewing, once I get there I already have the design pretty settled. I am mostly following my own instructions. This is where I get to make decisions and be creative with how I approach the visualizations.

(I really find dot grid paper to be the best material to use at this stage. It gives you a structure to work with that ensures things are even, but it isn’t as dominating on a page as a full grid paper. Of course, this is just my opinion, and I love nothing more than doodling geometric patterns on dot grid paper. But using it really helps me translate dimensions to fabric and I can do my “measuring” here. For this project I am envisioning a 3 square foot quilt. The inner block. Block 1, is 12 x 12 inches, so each grid represents 3 inches.)

There is no one set way with how to approach this, this is just a documentation of how I like to do it. If this doesn’t resonate with how you like to think about your projects that is fine! Do it your own way. But I design the way I write, which is to say extremely linearly. I am not someone who can write by jumping around a document. I like to know the flow so I start in the beginning and work my way to the end.

Ultimately, for quilt design, my process looks like this:

  1. Pick the block I am working on
  2. Pick which of the data I have gathered is a good fit for the topic
  3. Think about what is the most interesting part of the data, if I could only say one thing what would that be?
  4. Are there any quilting techniques that would lend itself to the nature of the data or the topic? For example: applique, English Paper Piecing, half square triangles, or traditional quilt block designs, etc.
  5. Once I have the primary point designed, are there other parts of the data that work well narratively? And is there a design way to layer it?

For example, this block on the demographics of people who complete thru-hikes of the trail using annual surveys since 2016. (Since they didn’t do the survey 2020 - and it was the center of the grid - I made that one an average of all of the reported years using a different color to differentiate it.)

I used the idea of the nine-patch block as my starting point, although I adapted it to be a base grid of 16 (4x4) patches to better fit with the dimensions of the visualization. I used the nine-patch idea to show the percentage of the gender (white being men and green being all other answers - such as women, nonbinary, etc). If it was a 50-50 split, 8 of the patches in each grid should be white, but that is never the case. I liked using the grid because it is easy to count the patches in each one, and by trying to make symmetrical or repetitive designs it is more obvious where it isn’t balanced.

A box divided into 9 squares, with each square having its one green and white checkered pattern using the dot grid of the paper as a guide. The center square is brown and white. On top of each square is a series of horizontal or vertical lines ranging from four to nine lines.

But I also wanted to include the data on the reported race of thru-hikers. The challenge here is that it is a completely different scale. While the gender split on average is 60-40, the average percentage of non-white hikers is 6.26%. In order to not confuse the two, I decided to use a different technique to display the data, relying on stitching instead of fabric. I felt this let me use two different scales at the same time, that are related but different. I could still play with the grid to make it easy to count, and used one full line of stitching to represent 1%. Then I could easily round the data to the nearest .25% using the grid as a guide. So the more lines in each section, the more non-white thru-hikers there were.

My last step, once I have completed a draft of the design, is to ask myself, “is this too chart-y?” It is really hard sometimes to avoid the temptation to essentially make a bar chart in fabric, so I like to challenge myself to see if there is a way I can move away from more traditional chart styles. Now, one of my blocks is essentially a bar chart, but since it was the only one and it really successfully highlighted the point I was making I decided to keep it.

A collection of designs using the log cabin layout made with a collection of muted highlighters. There are some pencil annotations next to the sketchesThese are not the final colors that I will be using. They will probably all be changed once I dye the fabric and know what I am working with.

Next steps

Now, the design isn’t final. Choosing colors is a big part of the look of the quilt, so my next step is dyeing my fabric! I am hoping to have a blogpost about the process of dyeing raw silk with plant-based dyes by the end of February. (I need deadlines, this will force me to get that done…) Once I have all of those colors I can return to the design and decide which colors will go where. More on that later. In the meantime let me know if you have any questions about this process! Happy to do a follow-up post as needed.

  •