What are the potential uses of artificial intelligence (AI) in helping to archive our digital past? That was the guiding question to our workshop that took place online on 19 July 2021.
The workshop was organised as part of the Unlocking our Digital Past project which is a collaboration between Loughborough University and the Cabinet Office. The project is funded by the Enterprise Projects Group at Loughborough University.
The overall aim of the project is to better understand how AI can help improve the preservation, access to and usability of born-digital archives (files that are created digitally and remain in a digital format).
The project is acutely aware that so much of the content we create today in our professional and personal lives is born-digital. We create more digital files than we ever have. Just think about the number of emails you send and receive in a day.
There are teams of professionals whose job it is to ensure that under the UK Public Records Act, each government department transfers records of historic value to The National Archives after twenty years. Before transfer, documents are kept within departments, that need to comply with statutory obligations (Freedom of Information requests, inquiries and the like). As the amount of digital material created grows exponentially, this task has become ever more difficult to manage.
The workshop – what was it all about?
The aim of our workshop was to bring together archivists, civil servants, and academic researchers (including computer scientists, historians, and digital humanities people) in a (virtual) room together to explore the potential uses of AI, and the implications of these uses for archivists and researchers.
We managed to secure some great speakers for the workshop that helped inspire conversations and cross-sector thinking.
Here is a quick summary of what the speakers had to say:
Dr Jenny Bunn – Head of Archives Research at The National Archives
Jenny started us off by explaining that artificial intelligence isn’t a thing in and of itself, but rather an ‘intellectual puzzle launched in the 1950s’ that has resulted in raft of techniques and technologies that has expanded the realm of what is possible. The vast number of digital records within organisations constitutes a ‘digital heap’. The work necessary to make that born-digital material accessible is different to paper records. Jenny said the focus should be to augment the expertise of archivists and offer machine-based perspectives to widen the ways in which access can be provided to digital archives.
Jenny’s set the day up perfectly and Professor Jason R. Baron’s talk went on to focus on how one of the techniques developed under the umbrella of AI could be used by archivists working with large email collections.
Jason R Baron – Professor of Practice at the University of Maryland
Jason started his talk by explaining how NARA started accepting US Presidential emails in the early 1980s.The volumes of emails created by each presidency has continued to rise. The President Reagan’s White House created around 500,000 emails. President Obama’s created 300 million. Under NARA’s Capstone policy most federal government agencies have been selecting some email accounts for permanent preservation. The success of NARA’s attempts to select and preserve email accounts has not yet been matched by successful techniques to sensitivity review the collections, and here there is a clear potential role for AI techniques, possibly including techniques borrowed from the legal world in eDiscovery cases. Jason described a sandbox project he and colleagues had carried out to test the potential to train machine models to spot emails that were protected by legal privilege.
Dr Adam Nix – Lecturer in Responsible Business at the University of Birmingham
Adam spoke about email archives from the perspective of a researcher trying to access and use them. Adam explained how his project, Historicizing the dot.com bubble and contextualizing email archives, was exploring ways of using machine learning techniques for search and visualization of the content of email archives.
Dr Annalina Caputo – Assistant Professor in the School of Computing at Dublin City University
Annalina followed on from Adam by talking about some of the potential ways AI techniques might help make large text-based archives more accessible to researchers. Annalina focused on how machine learning can help track and understand the changing meaning of words over time, and better connect named people, places, events and organisations in archival collections through a technique called ‘named entity recognition’. Through tracking the changes in named entities, it may be easier to see where contexts have changed within digital archival collections.
Dr Graham McDonald – Lecturer in Information Retrieval at the University of Glasgow
Graham focused on the potential of technology-assisted sensitivity review to assist human reviewers in the digital archiving process. Graham stressed the importance of a continually evolving, iterative system that learns about new and changing contexts from human experts. Graham’s doctoral project had focused on identifying sensitive information relating to international relations and personal information. He had sought to keep the model relatively simple to support explainability.
As well as this fantastic lineup of speakers, we had attendees contributing to group discussions from the Cabinet Office, the Foreign, Commonwealth and Development Office, National Library Wales / Llyfrgell Genedlaethol Cymru, Science Museum Group, The Alan Turing Institute, University of Bristol, National Archives and Records Administration (USA), Royal Danish Library, Riksarkivet / Swedish National Archives and many more.
What did the workshop tell us?
The workshop highlighted the importance of continued cross-sector and collaborative working as being key to being able to implement any sort of AI-driven solution to helping create accessible and useful digital archives. Archivists, civil servants, computer scientists, and researchers of all types need to continue collaborating to ensure we better understand each other’s processes, responsibilities and needs.
The workshop also discussed the need for transparency in terms of the need to support understanding of what AI tools are doing and how they are doing it. This is not to say that we all need to become experts in coding and in statistical learning models. Rather it means that:
- there needs to be transparency with regard to the data AI tools have been trained on.
- that the logic being used by the tools should, be as explainable as possible.
The workshop was clear that AI does not replace human intelligence or expertise. AI tools will be augmenting human intelligence, not necessarily replacing it. For a number of years, a type of AI implementation called ‘human-in-the-loop’ has been used. This means that there is always a human interacting with the AI decision making, saying yes or not to its suggestions. However, Graham McDonald explained that we might reframe this as ‘computer-in-the-loop’, suggesting that the AI is still the tool that is being used to support human driven processes – not the other way around.
The workshop was the first of two being planned as part of the project. Our collaboration with the Cabinet Office continues and we welcome contributions from more government professionals, archivists and academics who are interested in this project. The next workshop is scheduled for September, and we are hoping to build on the main themes that came out of the first workshop.
To keep up to date with the project or learn more about the presentations from the workshop – head to our blog page and keep an eye on the hashtag #OurDigitalPast on Twitter.