Digital Metadata: Digital Preservation

Digital Preservation

I had the opportunity to attend the 11th iPRES conference held in Melbourne - the first time the conference had been held in the Southern Hemisphere! The digital preservation community is relatively small so the conference, with 177 delegates, was well attended and included staff from university libraries, state and national libraries, archives, museums, commercial vendors and technology developers. 46% of delegates were international, giving a wide range of expertise and experiences. As a novice to the digital preservation space there was much to take in, however a number of ‘themes’ were apparent as was indicated by the various initiatives in the sector.

Why care about research data management?

Although slightly outside the 'scope' of digital preservation, research data management is still an important part of the process. Without properly managed data, there is nothing for us to preserve. Ross Wilkinson, Director of the Australian National Data Service (ANDS), gave a keynote presentation on the reasons why researchers, and institutions, need to embrace research data management. These include compliance with the Australian Code of Responsible Conduct of Research and funding bodies, as well as the long term management of researcher data. Additionally, sharing data can lead to data citation and increased collaboration. Citation rates, particularly if the collaboration is international, can increase by up to three times.

Institutions also need to embrace research data management, and need to start viewing research data as a research output rather than just a research by-product. Sharing data and making data available also supports the research ambitions of the institution, which in turn supports the research strategy of the institution. The Vice Chancellor of the University of Tasmania, Professor Peter Rathjen, has said that as reputation is very important to research institutions, and as libraries make substantial contributions to that reputation, libraries (being the experts on digital collections) should be supported in creating world class data collections which in turn can help an institutions reputation.

Changes to the scholarly communication model

Andrew Treloar, Director of Technology for ANDS, gave a presentation on the changes to the scholarly communication process. The existing/previous system of registration (journal submission), certification (peer review), awareness (discovery services) and archiving (libraries, publishers, archiving services such as Portico) is changing. Much more of the scholarly process is now on the web and is wholly digital, and includes not just publications but also datasets, slides, wikis, processes, workflows, and logs, all packaged up and surrounded by metadata. This change of process has meant that the existing systems of scholarly communication have also shifted. Registration systems now include things such as Protein Banks and Wiki Pathways, certification systems include open peer review, awareness systems include open wikis and e-lab books, and archiving systems include institutional and data repositories (which although technically are not "preservation" systems, ANDS recognised that they are as good as we have at the moment).

However this new scholarly communication system does pose a problem for citing sources. In the past, the majority of publications remained ‘findable’ but now cited webpages can change and cited datasets may disappear. Common web platforms are increasingly used for scholarship, such as wikis, Github, Twitter and Wordpress. Many of these have desirable characteristics such as versioning, time stamping and social embedding, but they record rather than archive. This is a problem as they capture critical elements of the scholarly record which will be lost over time.

There is a difference between the scholarly process (which is short term, write many/read many, no guarantees provided) to the scholarly record (longer term, write once/read many, attempt to provide guarantees). We need to start thinking about moving from recording the scholarly process to archiving the scholarly record.

Another presentation by Herbert Van de Sompel, Los Alamos National Laboratory, talked about problems with referencing web content. At the same time that we are adding things to the web, we are losing as well. A report into social media documentation following the Egyptian Revolution in 2011 found that 10% of references had disappeared off the web a year after the event.

There are two problems with referencing web content - link rot, where links stop working, and content drift, where linked content changes over time. A study shortly to be published by PLoS found that 15% of links in articles submitted in 2012 were already dead, with 35% being dead after 5 years.

There are ways the preserve this content. An experimental solution for Zotero will automatically send a website to the internet archives when it is bookmarked in Zotero. The user then gets a link to the website as well as a link to the web archive version along with the date in Zotero. Other solutions are identifier systems such as DOIs, however these are only as good as the agency or organisation that is managing them.

Preservation processes

Preservation Policies

"Without a policy framework a digital library is little more than a container for content". Preservation policies have important features, one of which is to inform various stakeholders of the digital archives and provide transparency about the approaches to preservation. It also enhances the "trustworthiness" of the archive. However most institutions either do not have a preservation policy or it is so out of date that it is effectively useless.

Barbara Sierman, National Library Netherlands, reported on the European SCAPE project which looked at, amongst other things, what is required for a preservation policy. Using the few resources available, they developed a catalogue of policy elements as a guideline to help improve preservation policies. A maturity level model was also introduced to score how mature the policy framework for an institution is.

Collection Profiling

Maureen Pennock, Head of Digital Preservation at the British Library, outlined their process for collection profiling, which is a precursor to collection preservation. It is important as it defines the type of content being held and what needs to be preserved. It is also important for looking at what shouldn't be preserved – disposal is just as important as preservation.

When collection profiling, a number of factors are examined, including a summary of the content, acquisition methods and formats, preservation intent, issues with preservation, and any sub-collections and different representations of collections. An example was given using the British Library web archives. Once collection profiling is completed, file format assessment and preservation profiling can begin.

File Format Assessment

The concept of file format endangerment and obsolescence is important when considering digital preservation activities. According to Heather Ryan, Assistant Professor at University of Denver, file format endangerment describes the possibility that information stored on a particular file format will not be interpretable or renderable using standard methods within a certain timeframe, whereas file obsolescence occurs when information stored in a particular file format is no longer accessible using current technologies.

File format assessments analyse the risk of the use of a file format. The British Library has analysed a number of file formats, looking at such characteristics as development status, adoption, software support, complexity, external dependencies, technical protection mechanisms and legal issues. These are used in conjunction with other preservation activities to determine the endangerment level of particular file formats. Examples were given using TIFF, JP2 and PDF. Researchers at the University of Denver have also identified three key factors when considering file endangerment: availability of rendering software, specifications and community/third party support.

Technical Registries

Technical registries are used in digital preservation to enable organisations to maintain definitions of the formats, format properties, software, migration pathways, etc., needed to preserve content over the long term. There are numerous technical registries around at the moment, with more currently being developed. The most commonly used registries are PRONOM, Freebase and DROID.

One of the problems with most registries is that they are fixed information models, and are difficult to evolve. All registries also do not describe the same thing, with different use cases and technical requirements for each registry. Each institution needs to evaluate their own requirements and risk analysis of digital preservation practices.

Two new registries have been created by Preservica and the National and State Libraries of Australasia (NSLA). Preservica has released their Linked File Format Registry along with their new version of software. Through peer-to-peer collaboration, the registry can be added and edited, with each installation choosing which changes the wish to incorporate into their own instance of the registry – a first of its kind ‘linked’ data registry. The NSLA is also funding a project to create a Digital Preservation Technical Registry to collate the information for the various registries into one place.

There is interest in automating as much of the digital preservation workflow as possible. The Data Archiving and Networked Services in the Netherlands has developed a system called Epimenides that can check whether a newly ingested file is in an acceptable or preferred format, check whether it is migratable to another format, and do any migrations necessary automatically depending on rules set up in the system.

Preservation and digital repositories

Tomasz Miksa, from SBA Research Austria, presented on risk assessments for digital repositories. Digital preservation of a repository system should not only concern the content, but also the workflows, software, and metadata. If the system cannot visualise the data or see the data, then the repository is useless. Technical aspects as well as organisational needs should be considered during risk assessments. Repository systems may be required to undergo several digital preservation actions in order to preserve both the system and workflows. Dependence on external services and insufficient documentation are dependencies for digital preservation actions.

The Danish State and University Library reported on a self-assessment of their digital repository that they undertook in order to expose the drivers for digital preservation, improve staff and management understanding of the digital preservation challenges, and to enable benchmarking with other digital preservation organisations.

Conclusions

Digital preservation is still largely ad hoc in many institutions, with confusion regarding exactly just what it is. Collaboration amongst the digital preservation community was a key message throughout the conference, although as one delegate mentioned, there are currently multiple file format registries being worked on, so while everyone seemingly agrees that collaboration is the key, it remains to be seen to be put into practice.

And for those institutions that are currently facing an uphill battle with management to try to show the value of digital preservation, the suggestion was made to try masking the records or files in an organisation that are older than 5 years old as unreadable, and see what sort of outcry there is when people try to access them.

Image credit: http://lts2.evault.com/homepage/digital-preservation/

Digital Metadata

Tuesday, 21 October 2014

Digital Preservation

Digital Preservation

Why care about research data management?

Changes to the scholarly communication model

Preservation processes

Preservation and digital repositories

Conclusions

No comments:

Post a Comment