The Unlocking our Digital Past project is seeking to understand the implications of using AI on digital archival material from multiple perspectives. We had explored some of the potential uses of artificial intelligence (AI) for digital archives in the first Unlocking our Digital Past workshop on 19 July. This second workshop aimed to pinpoint some of the technologies that could be used to support specific tasks. The focus of the first session was on selection, appraisal and review processes. The focus of the second session was the future of access to digital archives.
Two major themes seemed to emerge from the talks.
- There is no single AI-driven tool that will make digital archive materials more preservable, accessible and usable. We need to be thinking about a suite of tools that complete specific tasks that can be tailored for individual archival needs.
- There is a need to balance the responsibility of records managers and archivists for respecting confidentiality, privacy and intellectual property rights with the responsibility for making public records and archival collections accessible.
A summary of talks is given below. Recordings of the presentations are available on the Unlocking our Digital Past website.
Professor Cal Lee – School of Information and Library Science at the University of North Carolina, Chapel Hill.
Cal started the workshop with a talk on using Natural Language Processing (NLP) and Machine Learning (ML) in archival selection and appraisal workflows. Cal explained how how digital material exists at different levels of representation (single file or complex assemblages of files, code and software that need to work together to activate an item). This can make decision making around selection and appraisal knottier than the more direct understandings and decisions that can be made around analogue material.
Cal talked us through a number of tools that he has been involved in developing. He began with BitCurator which is designed as an operating system and tool set that can support appraisal and selection – drawing on open-source tools. There have been additions to BitCurator, including one that uses NLP to allow Named Entity Recognition within the digital archives being processed. Other tools included RATOM which is a tool to support the review of emails for sensitivity and public access, and Carascap which is a more focused sensitivity review tool.
Cal underlined that he and his collaborators are guided by the belief that appraisal and selection using AI/ML methods needs to be transparent and open for review by others.
Professor Richard Marciano – University of Maryland iSchool
Richard’s talk addressed the ways in which AI and a Computational Thinking framework can be applied to archival collections. His case study focused on Japanese-American Second World War incarceration camp records.
Richard explained how the project developed knowledge graphs from index cards, and in the process built ‘computational biographies’ of people who feature in the collection. The project ran OCR (optical character recognition) on the physical index cards held in the archive, ran named entity recognition processes to identify individuals, places and concepts, cleaned the data using the Open Refine tool, created further categories and built knowledge graphs from that.
Leslie Johnston – US National Archives and Records Administration (NARA)
Leslie rounded off the first session on using AI in review processes by calling for modest approaches to machine learning, claiming that we already use a lot of AI without really considering it AI. Incrementally adding more task-specific AI tools to existing workflows will be better in the long run that trying to implement the perfect singular AI tool (which doesn’t really exist!).
Leslie reminded us that optical character recognition (OCR) is a form of machine learning based on pattern recognition. It has been trained using supervised learning and uses computer vision to recognise text and page layouts. Chatbots use natural language processing and speech to text for transcriptions uses a form of natural language understanding and processing.
Dr Andrew Dixon – SVGC
Andrew focused his talk on the work SVGC had done with the Foreign, Commonwealth and Development Office (FCDO) in designing and implementing their digital sensitivity review process. The project was done in partnership with a team called Cicero made up of a range of different industry partners, organisations and universities.
Andrew gave an overview of the steps taken during the project; from initial appraisal and selection, onto sensitivity review and then preparation for transfer to The National Archives UK (TNA). He explained how AI was helpful for a number of issues identified within the project, including the scale of data, dealing with duplications, increasing consistency and efficiency, and fatigue in people.
The talk went into more detail on the increased risk of mosaicking. Government records are often duplicated across different departments. Sensitivity reviews are conducted by each department before transfer to TNA, and context specific sensitivities will be different for each department. It is possible that two or more departments might transfer the same document to TNA but with different redactions of sensitivities. This creates a risk as comparing the documents could lead to sensitive information being accessed. The project was interested in exploring further how counter-mosaicking can take place.
Leontien Talboom – University College London and The National Archives UK
Leontien said that access to digital records held in archives is not straightforward and that the provision of access was being held back by a number of constraints that her PhD was looking into.
Archives aim to provide access to ‘everyone’. Before the internet ‘everyone’ meant anyone able and willing to visit the archive in person. The coming of the internet had vastly expanded the scope of ‘everyone’ to potentially include anyone with access to the internet.
Leontien highlighted, (as Cal Lee had earlier) that born-digital material is inherently different to physical material and can be made accessible in multiple ways. She made the point that when processing born-digital materials, it is still done with a human in mind, however, Leontien suggested that we should be more aware of the computer in thinking about access processes and consider making collections available as processable data. Finally, Leontien explained how the Internet and the nature of digital materials had changed the expectations of users. People accustomed to using Google to search the internet would also expect to search for digital material in an archive with a similar tool. AI literate researchers might wish to run their own computing power over archival collections. Archives would have to decide whether they could allow that, and with what conditions or protections.
John Sheridan – The National Archives UK
John claimed that ‘AI was eating software’ and that archivists would need to get to grips with what that means for their practices and the ways in which the collections they manage will be accessed and used. He explained how physical archives had benefited from a certain amount of friction, with users only able to physically access small sub-sections of the collections at any one particular moment in time. This meant that sensitive data scattered across a collection was harder for a researcher come across. The potential for researchers to search and use AI on collections would remove that friction and increase the risk of sensitive data in archives being identified by end users.
John used an example of a tool (Pixel Recursive) that could de-pixelise photos to explain that advanced tools exist that could challenge redaction and privacy mechanisms. The existence of such tools had to be borne in mind when considering what type of access to provide to digital materials.
John concluded that it would be important to gradate access to digital collections. Some records might be completely open and available online. Other records might only be accessible in the reading room. TNA had developed a tool called DiAGRAM to help archivists identify and manage risk.
Chris Day – The National Archives UK
Chris discussed the applications of machine learning to analogue collections that had not been digitised. Machine learning could be applied to collection metadata to make those collections more accessible. Chris had used Topic Modelling on the catalogue records of the General Board of Health (1848-1871). He used Latent Dirichlet Allocation (LDA) to highlight important works and concepts in the catalogue and make the whole collection more accessible.
Nicki Welch – The National Archives UK
Nicki spoke about the future plans at TNA to create an access service for the digital records transferred from government departments. Nicki is responsible for creating that service. She said that at a basic level access arrangements were in place for some digital records, but there was definitely scope to for improvement.
Over the next few years, the plan is to develop an access service in consultation with government departments. TNA would explore whether the provision of gradated access would lessen the potential risks that government departments perceive in transferring digital records.
The workshop was the second of the project. Our collaboration with the Cabinet Office continues and we welcome contributions from more government professionals, archivists and academics who are interested in this project. This part of the project comes to an end in October, but a lot more is happening in this field. Check out the AEOLIAN and AURA projects for more information.