Textcavator renewed: new name, new upload feature, and new corpora
I-Analyzer is now called Textcavator, a name that better reflects the flagship tool of the CDH Research Software Lab (RSLab). In addition to the new name, the RSLab is introducing an upload feature for adding your own dataset, as well as several new corpora. The Centre for Digital Humanities spoke with scientific developer Luka van der Plas about the updates.
Why the name change from I-Analyzer to Textcavator?

‘Textcavator better reflects what the tool actually does than the name I-Analyzer,’ says Van der Plas. ‘It is primarily designed for exploring texts, rather than for conducting in-depth analysis. Excavator literally means a digging machine, but it is also used figuratively to mean digging into something. And that is exactly what the tool does: it retrieves information from texts.’
What can you do with Textcavator?
‘At its core, it is a search engine: you can search a dataset using keywords that are relevant to your research. It is a comprehensive tool for finding what you are looking for. That is why we refer to it as a text search and exploration tool, rather than a text-mining tool. Afterwards, you can carry out more extensive analyses yourself—qualitative or quantitative—using the search results you download from Textcavator.’
‘The analysis tools we offer within Textcavator—simple statistics and basic visualisations—are intended to help you search as effectively as possible, not to conduct your actual research. You can filter by time period or category and bookmark documents. We also provide visualisations and statistics to help refine search queries. These show, for example, how a search term is distributed across categories or time periods, or which words frequently appear in its context. Depending on the dataset, we also offer more advanced features, such as Word Embeddings and Named Entity Recognition.’

Many text exploration tools already exist. Why did you choose to develop your own?
‘The RSLab began developing what was then called I-Analyzer in 2017. This allowed us to tell researchers: we already have a working tool. We only need to load your dataset into it and perhaps add a button or two. That way, we can support even small projects with limited funding, which I find very rewarding.’
‘Working open source is also important to us. And because we develop the tool within the university, we are not driven by profit: we are truly here for the researcher. We work closely with researchers to develop Textcavator, although external users can also use it. We wanted a tool that is not overly technical, can accommodate many different types of datasets, and is suitable for all disciplines within the humanities.’
Are all datasets in Textcavator public?
‘No. We prefer to make data public, but that is not always possible. Cultural data is often protected by copyright. That is why, when uploading, you can decide who gets access: everyone, only the university, a specific (research) group, or just yourself.’
Which new corpora have been added?
‘In collaboration with the University Library, we have added several new corpora from the publisher Gale, including nineteenth-century British and American newspapers and magazines such as Punch and Illustrated London News. These are great additions to the newspaper corpora we already offer, such as The Guardian and The Times.’
How do you decide which corpora to add?
‘Many researchers bring their own data—collected or cleaned for their research. In addition to joint acquisitions with the University Library, we also occasionally add public corpora for which there is wide demand, such as the KB newspaper corpus, DBNL, Gallica and Le Figaro.’
Which research projects have used Textcavator?
‘The largest is People & Parliament, a leading project in political history, conducted with the University of Jyväskylä (Finland). They needed an efficient tool to search a vast collection of parliamentary debates from across Europe.’
‘Another example is Traces of Sound, a much smaller project. For this, we built a proof of concept in Textcavator using a small set of sources and annotations related to references to sound. This helped the researcher in submitting a larger grant proposal.’
How accessible is Textcavator for beginners?
‘We specifically focus on researchers with little experience in text and data mining. The tool is designed to be as user-friendly as possible. For those who want more, additional features are available. But even more advanced features, such as Named Entity Recognition, can be used without extensive technical knowledge.’
You are working on an upload feature. What does it entail?
‘The new upload feature allows researchers to add their own datasets directly to Textcavator. Currently, this is always done by us, which makes researchers dependent on our available time. We are now in the final development phase and are therefore organizing a pilot to test the feature together with the research community.’
What do you hope to learn from this pilot?
‘One goal is to identify any bottlenecks. Textcavator is designed for highly diverse data, which can also complicate things. We want to ensure everything works smoothly and clearly before opening the feature to everyone. During the pilot we will receive feedback and can step in immediately if something is unclear or not yet working properly.’
‘We also think it is important that the feature truly aligns with researchers’ needs. For example: how much should be filled in automatically, and how much should users be able to configure themselves? In which file formats would they like to upload their data? Instead of speculating about this behind closed doors, we want to ask users directly.’
Who can participate in the pilot?
‘We are looking for a broad group of researchers. Anyone with data they would like to add to Textcavator can take part. This may be their own research data, but also an open access dataset. A small Excel file with a hundred documents is just as welcome as a large dataset. The only requirement is that you can clean of format the data yourself, if needed.
Which features could be added if there is demand?
‘In the short term, we aim to make the process user-friendly and accessible, focusing on small adjustments such as additional guidance and feedback. In the long term, we are considering larger expansions, such as more file formats, or even manual data entry.’
‘There are also features already in Textcavator that are not yet offered through the form, such as adding images or word embeddings. These could be valuable additions, but they also make the upload process more complex for researchers.’
What have you learned from developing a tool for so many disciplines?
‘The biggest challenge is maintaining clarity for the user. We continue to add new features, but we want to prevent the interface from becoming overwhelming. It is a constant balance between accessibility and technical possibilities.’
‘And what strikes me is how similar the needs of researchers in the humanities and social sciences actually are. You might expect them to require very different tools, but in practice that is not the case.’
Currently, scientific developers Luka van der Plas, Jelte van Boheemen, Mees van Stiphout and Ben Bonfil are working on Textcavator alongside their other projects.
Read more about Textcavator here.
Read more and sign up for the pilot here.
The post Textcavator renewed: new name, new upload feature, and new corpora appeared first on Centre for Digital Humanities.