Home / Expert Advisories / Campus & community / Research / Unlocking the past: GenAI in library archival research

Unlocking the past: GenAI in library archival research

By Daniela Garrido Fajardo

UVic Libraries pioneers AI research and technology in library archives, bringing old newspaper history back to life

In an age of constant misinformation, cautionary tales about the benefits and pitfalls of artificial intelligence (AI) ring loud throughout the silos of academia. But what if AI could help researchers uncover old archives once inaccessible, and help make library catalogues easier to navigate?

As part of the University of Victoria Libraries’ Kula: Library Futures Academy, an open-source retrieval-augmented generation (RAG) pipeline is being developed using historic newspapers held in the archives. The proposed RAG pipeline can serve as a template for other institutions looking to use language learning models (LLMs) with digitized collections to expand knowledge in a practical, ethical and responsible way.

Digital Preservation Librarian Corey Davis and Kula Fellow Chloë Farr have been working with RAG and optical character recognition (OCR) technologies to better capture metadata from newspapers and produce information retrieval with generative AI to produce accurate, context-aware responses grounded in authoritative source material.

Q: Tell us more about the “Unlocking the Past” project and how it came to be.

Davis: The main idea for the project was to augment LLMs with data specifically from the library’s collections, utilizing the power that these technologies have for semantic search, retrieval and to generate answers to make our collections more accessible.

LLMs work by sucking up as much data as they possibly can from the internet. The big limitation is that there’s no way for them to know what is going on in the world after their training is finished. When my research started, we were looking at something called retrieval of the augmented generation, which is also shortened to RAG. The information we’re looking at now is web-based, so we have tools to capture websites. However, this information can often be inaccessible, and so we wanted to see if we could make it more accessible using these RAG pipelines to create chatbots that overlay our web archives.

Since our initial RAG efforts were successful, we decided to look at other types of collections. We have the equivalent of 200 years of historic digitized newspapers from Victoria to analyze and archive. But before you can get to the point where you’re extracting the data from these newspapers, you’re first turning it into a form that these large language models can understand. You have to get it out of the scanned images.

Q: What does the process of digitizing historical newspaper archives look like? Have there been any challenges in this process?

Farr: The technology I’m working with is OCR. Right now, if you go to the Internet Archive and look at any of the Times Colonist collections, what you’ll see is a picture of the scanned newspaper and a search box. If you type a word into the search box, you may get some results, but the likelihood of capturing all instances of it on the newspaper page is low. It’s probably not going to capture them all.

For instance, if someone is typing in “Vancouver” and wants to find all instances of “Vancouver,” if it’s been hyphenated twice on the page, there’s going to be at least two less results than there would be otherwise. Additionally, if you were to go and download the plain text files, which you can on the Internet Archive, you can download thousands of types of files from the documents you’ll see including a lot of junk, like special characters, half words and words that are merged. It’s just not consistent or reflective of the actual text on the page.

A language model can handle bad text, but it cannot handle bad text beyond simple spelling mistakes, where the majority of words become non-words. The results just won’t be good enough. History would be completely distorted if we used the existing text conversions that have been generated over the last 15 years.

There are two common OCR tools: Abbey Fine Reader and Tesseract. They have been the industry standard for typed text until about a year ago. We are now looking into what’s called vision language models, which is a combination of vision, machine learning and language models to interpret textual images. It is a level up from OCR because the vision models understand the layout of the text and can see through flaws with enhanced machine learning. Then it can see beyond blur, it can see beyond creases in the page or skewed rotation, it can interpret the newspaper’s layout. It uses its interpretation of the document as a whole to reconcile flaws and provide the most likely text.

Challenge wise, I think it’s getting an appropriate data set to test on. I have to be mindful of what I’m using and where it comes from because the work that I’m doing is mostly on archival material. It is subject to the licenses and agreements that they made with the donors of that material. For me, that’s the biggest challenge. Just making sure that the distribution of testing data is accurate and appropriate for what I’m trying to glean.

Q: What positive changes do you foresee coming from this project? What are the ethical considerations?

Davis: There are all these positive things where it opens the historical record in ways that are much more robust. For local genealogists, historians or researchers that are looking into things, suddenly they have the ability to interact with this information in ways that they couldn’t have thought of before.

There’s been this enforced obscurity to the material because before these vision language models and large language models came out, the technologies used to extract information weren’t as effective—it just sort of gave you a hint of what was in there. In many cases what you had to do was open the PDF in your browser and look at it almost like you would an old microfiche, you know, in the old movies where someone’s scanning through and looking for that stuff. That has really positive effects in terms of access.

However, there’s also a lot of ethical issues with resurfacing materials, like a letter to the editor or personal correspondence that might contain information that would impact people today. If you were to look at the Times Colonist or the Daily Colonist in the 1870s in terms of how they dealt with things, for example Indigenous topics, you’d get a very different view of things. And so, when you start to surface those things, how do you provide the context around them so that people understand that this is an historical artifact without amplifying, say, the racism or the sexism and other problematic content?

There’s still this issue of hallucinations where LLMs don’t understand the world; they’re just statistically predicting what the next word in a sentence might be. If these hallucinations occur when AI is giving us information within an academic environment, where trust, truth and ground source truth (i.e. accurate, real-world data used as a benchmark to train and evaluate models) are so critically important, we need to be careful to make sure that they’re working and that the appropriate guardrails are in place.

The final thing I’ll mention is that for the creators and publishers of this data, it’s still unclear how copyright applies to this stuff. Can we use this material to build these things? Is it within the domain that we can work with this stuff? There’s also a lot of increasing polarization around AI in society without understanding the roots of technology and its impacts. These are the kinds of things that we’re juggling right now.

Q: How can other institutions looking to responsibly use LLMs with digitized collections implement similar projects?

Farr: Everyone is kind of moving together right now, so I wouldn’t say we have universal, concrete guidelines that we can follow, especially when it comes to my side of the project of the OCR work. Between researchers, at this point, the footing is quite even. The institutions that are working on this are doing so simultaneously and sometimes working through these questions together, I would say. That would be my recommendation–don’t do this in isolation. There are a lot of questions: institutional, ethical and legal. Trying to answer them alone is going to make anyone lose the plot on the work that’s being done.

Davis: We hope to move the world forward.

Farr: We’re spending more time pausing on questions and thinking about the answers, than working on the technical solutions. We’re prioritizing the benefit over the innovation, and we’re very open to dialogue on this topic.

Latest expert advisories